pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.5k stars 482 forks source link

Model support for `doctr_det_predictor` with Torch_XLA2 #8123

Open ManfeiBai opened 1 month ago

ManfeiBai commented 1 month ago

Fix the model test for doctr_det_predictor.py

  1. setup env according to Run a model under torch_xla2
  2. Run model test under run_torchbench/ with python models/your_target_model_name.py
  3. Fix the failure.

Please refer to this guide as guide to fix:

Also refer to these PRs:

barney-s commented 3 weeks ago

Failed saying tf2onnx was missing. did a pip install. Failed again in tensorflow path ???

barni@barni ~/workspace/pytorch-tpu/run_torchbench
 % JAX_PLATFORMS=cpu python models/doctr_det_predictor.py
/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:337: UserWarning: Device capability of jax unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`.
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/google/home/barni/workspace/pytorch-tpu/run_torchbench/models/doctr_det_predictor.py", line 61, in <module>
    sys.exit(main())
  File "/usr/local/google/home/barni/workspace/pytorch-tpu/run_torchbench/models/doctr_det_predictor.py", line 19, in main
    module = importlib.import_module(model_name)
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/google/home/barni/workspace/pytorch-tpu/run_torchbench/benchmark/torchbenchmark/models/doctr_det_predictor/__init__.py", line 5, in <module>
    from doctr.models import ocr_predictor
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/__init__.py", line 1, in <module>
    from . import io, models, datasets, contrib, transforms, utils
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/__init__.py", line 1, in <module>
    from .classification import *
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/classification/__init__.py", line 1, in <module>
    from .mobilenet import *
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/classification/mobilenet/__init__.py", line 4, in <module>
    from .tensorflow import *
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/classification/mobilenet/tensorflow.py", line 15, in <module>
    from ....datasets import VOCABS
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/datasets/__init__.py", line 3, in <module>
    from .generator import *
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/datasets/generator/__init__.py", line 4, in <module>
    from .tensorflow import *
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/datasets/generator/tensorflow.py", line 8, in <module>
    from .base import _CharacterGenerator, _WordGenerator
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/datasets/generator/base.py", line 14, in <module>
    from ..datasets import AbstractDataset
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/datasets/datasets/__init__.py", line 4, in <module>
    from .tensorflow import *
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/datasets/datasets/tensorflow.py", line 15, in <module>
    from .base import _AbstractDataset, _VisionDataset
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/datasets/datasets/base.py", line 16, in <module>
    from ...models.utils import _copy_tensor
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/utils/__init__.py", line 4, in <module>
    from .tensorflow import *
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/utils/tensorflow.py", line 10, in <module>
    import tf2onnx
ModuleNotFoundError: No module named 'tf2onnx'
barni@barni ~/workspace/pytorch-tpu/run_torchbench
 % pip install tf2onnx
Collecting tf2onnx
  Downloading tf2onnx-1.16.1-py3-none-any.whl.metadata (1.3 kB)
Requirement already satisfied: numpy>=1.14.1 in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from tf2onnx) (2.0.2)
Requirement already satisfied: onnx>=1.4.1 in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from tf2onnx) (1.17.0)
Requirement already satisfied: requests in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from tf2onnx) (2.32.3)
Requirement already satisfied: six in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from tf2onnx) (1.16.0)
Requirement already satisfied: flatbuffers>=1.12 in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from tf2onnx) (24.3.25)
Collecting protobuf~=3.20 (from tf2onnx)
  Using cached protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (679 bytes)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from requests->tf2onnx) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from requests->tf2onnx) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from requests->tf2onnx) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages (from requests->tf2onnx) (2024.8.30)
Downloading tf2onnx-1.16.1-py3-none-any.whl (455 kB)
Using cached protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
Installing collected packages: protobuf, tf2onnx
  Attempting uninstall: protobuf
    Found existing installation: protobuf 5.28.3
    Uninstalling protobuf-5.28.3:
      Successfully uninstalled protobuf-5.28.3
Successfully installed protobuf-3.20.3 tf2onnx-1.16.1
barni@barni ~/workspace/pytorch-tpu/run_torchbench
 % JAX_PLATFORMS=cpu python models/doctr_det_predictor.py
/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:337: UserWarning: Device capability of jax unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`.
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/google/home/barni/workspace/pytorch-tpu/run_torchbench/models/doctr_det_predictor.py", line 61, in <module>
    sys.exit(main())
  File "/usr/local/google/home/barni/workspace/pytorch-tpu/run_torchbench/models/doctr_det_predictor.py", line 21, in main
    benchmark = benchmark_cls(test="eval", device = "cpu")
  File "/usr/local/google/home/barni/workspace/pytorch-tpu/run_torchbench/benchmark/torchbenchmark/util/model.py", line 43, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/usr/local/google/home/barni/workspace/pytorch-tpu/run_torchbench/benchmark/torchbenchmark/models/doctr_det_predictor/__init__.py", line 22, in __init__
    predictor = ocr_predictor(
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/zoo.py", line 114, in ocr_predictor
    return _predictor(
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/zoo.py", line 32, in _predictor
    det_predictor = detection_predictor(
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/detection/zoo.py", line 103, in detection_predictor
    return _predictor(arch, pretrained, assume_straight_pages, **kwargs)
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/detection/zoo.py", line 50, in _predictor
    _model = detection.__dict__[arch](
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/detection/differentiable_binarization/tensorflow.py", line 390, in db_resnet50
    return _db_resnet(
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/detection/differentiable_binarization/tensorflow.py", line 301, in _db_resnet
    feat_extractor = IntermediateLayerGetter(
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/utils/tensorflow.py", line 134, in __init__
    intermediate_fmaps = [model.get_layer(layer_name).get_output_at(0) for layer_name in layer_names]
  File "/usr/local/google/home/barni/miniconda3/envs/diffusion-models-2/lib/python3.10/site-packages/doctr/models/utils/tensorflow.py", line 134, in <listcomp>
    intermediate_fmaps = [model.get_layer(layer_name).get_output_at(0) for layer_name in layer_names]
AttributeError: 'Activation' object has no attribute 'get_output_at'
barni@barni ~/workspace/pytorch-tpu/run_torchbench
 % 
ManfeiBai commented 3 weeks ago

thanks for investigation, it looks like not torch_xla2 code implementation related

since PyTorch benchmarking dashboard is not maintaining this model now, let's skip this model too: https://hud.pytorch.org/benchmark/torchbench/inductor_no_cudagraphs?dashboard=TorchInductor&startTime=Sun,%2028%20Jan%202024%2000:00:00%20GMT&stopTime=Sun,%2004%20Feb%202024%2000:00:00%20GMT&granularity=hour&mode=training&model=doctr_det_predictor&dtype=amp&deviceName=cuda%20(a100)&lBranch=main&lCommit=578b8d75e5220a8bad3b4c94e3385f9bf721c1dc&rBranch=main&rCommit=25b2e4657308ae4b1508f260a83f81ba155885bf