open-mmlab / mmdeploy

OpenMMLab Model Deployment Framework
https://mmdeploy.readthedocs.io/en/latest/
Apache License 2.0
2.77k stars 636 forks source link

[Bug] Jetson Orin model convert fail #2396

Closed zhanghui-china closed 1 year ago

zhanghui-china commented 1 year ago

Checklist

Describe the bug

when i execute cuda_resnet.sh

python tools/deploy.py \ configs/mmpretrain/classification_tensorrt-int8_static-224x224.py \ /home1/zhanghui/mmpretrain/configs/resnet/resnet50_8xb32_in1k.py \ /home1/zhanghui/resnet50_batch256_imagenet_20200708-cfb998bf.pth \ /home1/zhanghui/mmpretrain/demo/demo.JPEG \ --work-dir mmdeploy_models/mmpretrain/resnet50 \ --device cuda:0 \ --dump-info

it shows:FileNotFoundError: [Errno 2] No such file or directory: 'data/imagenet/val

Reproduction

export MMDEPLOY_DIR=/home1/zhanghui/mmdeploy cd ${MMDEPLOY_DIR} sh ./cuda_resnet50.sh

Environment

Nvidia Jetson Orin 32G
Jetpack 5.1.1
CUDA 11.4.215
cuDNN 8.6.0.166
Archiconda  Python 3.8.13
Pytorch 1.11
torchvision 0.12.0
Tersorrt 8.5.2

Error traceback

(mmdeploy3.8) zhanghui@ubuntu:/home1/zhanghui/torchvision$ cd ${MMDEPLOY_DIR}
(mmdeploy3.8) zhanghui@ubuntu:/home1/zhanghui/mmdeploy$ sh ./cuda_resnet50.sh
^[[A/home/zhanghui/archiconda3/envs/mmdeploy3.8/lib/python3.8/site-packages/torchvision-0.12.0-py3.8-linux-aarch64.egg/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libjpeg.so.9: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
09/01 18:36:28 - mmengine - WARNING - Failed to search registry with scope "mmpretrain" in the "Codebases" registry tree. As a workaround, the current "Codebases" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmpretrain" is a correct scope, or whether the registry is initialized.
09/01 18:36:28 - mmengine - WARNING - Failed to search registry with scope "mmpretrain" in the "mmpretrain_tasks" registry tree. As a workaround, the current "mmpretrain_tasks" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmpretrain" is a correct scope, or whether the registry is initialized.
09/01 18:36:31 - mmengine - INFO - Start pipeline mmdeploy.apis.pytorch2onnx.torch2onnx in subprocess
/home/zhanghui/archiconda3/envs/mmdeploy3.8/lib/python3.8/site-packages/torchvision-0.12.0-py3.8-linux-aarch64.egg/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libjpeg.so.9: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
09/01 18:36:32 - mmengine - WARNING - Failed to search registry with scope "mmpretrain" in the "Codebases" registry tree. As a workaround, the current "Codebases" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmpretrain" is a correct scope, or whether the registry is initialized.
09/01 18:36:32 - mmengine - WARNING - Failed to search registry with scope "mmpretrain" in the "mmpretrain_tasks" registry tree. As a workaround, the current "mmpretrain_tasks" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmpretrain" is a correct scope, or whether the registry is initialized.
Loads checkpoint by local backend from path: /home1/zhanghui/resnet50_batch256_imagenet_20200708-cfb998bf.pth
09/01 18:36:34 - mmengine - WARNING - DeprecationWarning: get_onnx_config will be deprecated in the future.
09/01 18:36:34 - mmengine - INFO - Export PyTorch model to ONNX: mmdeploy_models/mmpretrain/resnet50/end2end.onnx.
09/01 18:36:34 - mmengine - WARNING - Can not find torch._C._jit_pass_onnx_autograd_function_process, function rewrite will not be applied
09/01 18:36:34 - mmengine - WARNING - Can not find torch._C._jit_pass_onnx_deduplicate_initializers, function rewrite will not be applied
09/01 18:36:41 - mmengine - INFO - Execute onnx optimize passes.
09/01 18:36:42 - mmengine - INFO - Finish pipeline mmdeploy.apis.pytorch2onnx.torch2onnx
09/01 18:36:45 - mmengine - INFO - Start pipeline mmdeploy.apis.calibration.create_calib_input_data in subprocess
/home/zhanghui/archiconda3/envs/mmdeploy3.8/lib/python3.8/site-packages/torchvision-0.12.0-py3.8-linux-aarch64.egg/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libjpeg.so.9: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
09/01 18:36:46 - mmengine - WARNING - Failed to search registry with scope "mmpretrain" in the "Codebases" registry tree. As a workaround, the current "Codebases" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmpretrain" is a correct scope, or whether the registry is initialized.
09/01 18:36:46 - mmengine - WARNING - Failed to search registry with scope "mmpretrain" in the "mmpretrain_tasks" registry tree. As a workaround, the current "mmpretrain_tasks" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmpretrain" is a correct scope, or whether the registry is initialized.
Loads checkpoint by local backend from path: /home1/zhanghui/resnet50_batch256_imagenet_20200708-cfb998bf.pth
Process Process-3:
Traceback (most recent call last):
  File "/home/zhanghui/archiconda3/envs/mmdeploy3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/zhanghui/archiconda3/envs/mmdeploy3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home1/zhanghui/mmdeploy/mmdeploy/apis/core/pipeline_manager.py", line 107, in __call__
    ret = func(*args, **kwargs)
  File "/home1/zhanghui/mmdeploy/mmdeploy/apis/calibration.py", line 59, in create_calib_input_data
    dataset = task_processor.build_dataset(calib_dataloader['dataset'])
  File "/home1/zhanghui/mmdeploy/mmdeploy/codebase/base/task.py", line 156, in build_dataset
    dataset = DATASETS.build(dataset_cfg)
  File "/home1/zhanghui/mmengine-0.8.4/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/home1/zhanghui/mmengine-0.8.4/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/home1/zhanghui/mmpretrain/mmpretrain/datasets/imagenet.py", line 122, in __init__
    super().__init__(
  File "/home1/zhanghui/mmpretrain/mmpretrain/datasets/custom.py", line 219, in __init__
    self.full_init()
  File "/home1/zhanghui/mmpretrain/mmpretrain/datasets/base_dataset.py", line 178, in full_init
    super().full_init()
  File "/home1/zhanghui/mmengine-0.8.4/mmengine/dataset/base_dataset.py", line 296, in full_init
    self.data_list = self.load_data_list()
  File "/home1/zhanghui/mmpretrain/mmpretrain/datasets/custom.py", line 264, in load_data_list
    samples = self._find_samples()
  File "/home1/zhanghui/mmpretrain/mmpretrain/datasets/custom.py", line 224, in _find_samples
    classes, folder_to_idx = find_folders(self.img_prefix)
  File "/home1/zhanghui/mmpretrain/mmpretrain/datasets/custom.py", line 31, in find_folders
    folders = list(
  File "/home1/zhanghui/mmengine-0.8.4/mmengine/fileio/backends/local_backend.py", line 527, in _list_dir_or_file
    for entry in os.scandir(dir_path):
FileNotFoundError: [Errno 2] No such file or directory: 'data/imagenet/val'
09/01 18:36:48 - mmengine - ERROR - /home1/zhanghui/mmdeploy/mmdeploy/apis/core/pipeline_manager.py - pop_mp_output - 80 - `mmdeploy.apis.calibration.create_calib_input_data` with Call id: 1 failed. exit.
RunningLeon commented 1 year ago

hi, you need to change val_dataloader in model_config to set calibration dataset for trt-int8.

zhanghui-china commented 1 year ago

pls tellme how to do

RunningLeon commented 1 year ago

hi, pls. refer to this doc: https://mmdeploy.readthedocs.io/en/latest/05-supported-backends/tensorrt.html#int8-support

github-actions[bot] commented 1 year ago

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

github-actions[bot] commented 1 year ago

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.