open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.37k stars 9.42k forks source link

If i ran compile example using pytorch 2.1.0, got an error #11154

Open DonggeunYu opened 11 months ago

DonggeunYu commented 11 months ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug I ran the below command for torch.compile test in pytorch 2.1.0. but got an error. This worked fine in pytorch 2.0.1.

Reproduction

  1. What command or script did you run?
python tools/train.py configs/rtmdet/rtmdet_s_8xb32-300e_coco.py  --cfg-options compile=True

Reference: https://github.com/open-mmlab/mmdetection/blob/main/docs/en/notes/faq.md

  1. Did you make any modifications on the code or config? Did you understand what you have modified? Answer: NO
  2. What dataset did you use? Answer: COCO2017

Environment

sys.platform: linux
Python: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0: Tesla V100-SXM3-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
PyTorch: 2.1.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.16.0+cu121
OpenCV: 4.8.1
MMEngine: 0.9.1
MMDetection: 3.2.0+fe3f809

Error traceback If applicable, paste the error trackback here.

11/10 05:59:48 - mmengine - INFO - load backbone. in model from: https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth
Loads checkpoint by http backend from path: https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth
11/10 05:59:53 - mmengine - INFO - Model has been "compiled". The first few iterations will be slow, please be patient.
11/10 05:59:53 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
11/10 05:59:53 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
11/10 05:59:53 - mmengine - INFO - Checkpoints will be saved to /new_mmdet/work_dirs/rtmdet_tiny_8xb32-300e_coco.
Traceback (most recent call last):
  File "/new_mmdet/train.py", line 122, in <module>
    main()
  File "/new_mmdet/train.py", line 118, in main
    runner.train()
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 112, in run_epoch
    self.run_iter(idx, data_batch)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 128, in run_iter
    outputs = self.runner.model.train_step(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/model/base_model/base_model.py", line 112, in train_step
    with optim_wrapper.optim_context(self):
  File "/usr/local/lib/python3.10/dist-packages/mmengine/model/base_model/base_model.py", line 113, in <resume in train_step>
    data = self.data_preprocessor(data, True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mmdet/models/data_preprocessors/data_preprocessor.py", line 121, in forward
    batch_pad_shape = self._get_pad_shape(data)
  File "/usr/local/lib/python3.10/dist-packages/mmdet/models/data_preprocessors/data_preprocessor.py", line 122, in <resume in forward>
    data = super().forward(data=data, training=training)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/model/base_model/data_preprocessor.py", line 246, in forward
    data = self.cast_data(data)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
    return _compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 458, in transform
    tracer.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
    super().run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
    return inner_fn(self, inst)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 1115, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py", line 261, in call_function
    return super().call_function(tx, args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 279, in inner
    example_value=get_fake_value(scalar_to_tensor_proxy.node, self),
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1376, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1337, in get_fake_value
    return wrap_fake_exception(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 916, in wrap_fake_exception
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1338, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1410, in run_node
    raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1397, in run_node
    return node.target(*args, **kwargs)
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in method scalar_tensor of type object at 0x7f4786581d80>(*([FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640)), FakeTensor(..., device='cuda:0', size=(3, 640, 640))],), **{}):
scalar_tensor(): argument 's' (position 1) must be Number, not immutable_list

from user code:
   File "/usr/local/lib/python3.10/dist-packages/mmengine/model/base_model/data_preprocessor.py", line 269, in <resume in forward>
    batch_inputs = stack_batch(batch_inputs, self.pad_size_divisor,
  File "/usr/local/lib/python3.10/dist-packages/mmengine/model/utils.py", line 36, in stack_batch
    assert tensor_list, '`tensor_list` could not be an empty list'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

matthost commented 8 months ago

Same issue, did you figure it out?

DonggeunYu commented 8 months ago

Same issue, did you figure it out?

No... Sorry..

matthost commented 3 months ago

Same issue in PyTorch 2.3, and when trying to explain model compilation with like dynamo.explain(inference_detector, model, image)