Closed formoree closed 4 weeks ago
Hi @formoree, I apologize for the inconvenience. Since I primarily use PyTorch 1.x, I haven't met this issue before. I tried installing PyTorch 2.1 with the following command:
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
I also modified the source code of MMCV following this link, and it worked well. You can try it.
I've reviewed the code and noticed that some dependencies, such as MMCV, could be removed to make the code compatible with PyTorch 2.x. However, I don't have enough time to implement it, so I can't say when I’ll be able to release it.
Looking forward to your feedback.
Thank you very much for your reply. According to your instructions, I have encountered a brand new problem.
Traceback (most recent call last):
File "main.py", line 390, in <module>
main(args)
File "main.py", line 151, in main
model, criterion, postprocessors, _ = build_model_main(
File "main.py", line 82, in build_model_main
from models.registry import MODULE_BUILD_FUNCS
File "/home/user/Documents/AiOS/models/__init__.py", line 1, in <module>
from .aios import build_aios_smplx
File "/home/user/Documents/AiOS/models/aios/__init__.py", line 1, in <module>
from .aios_smplx import build_aios_smplx
File "/home/user/Documents/AiOS/models/aios/aios_smplx.py", line 15, in <module>
from .transformer import build_transformer
File "/home/user/Documents/AiOS/models/aios/transformer.py", line 10, in <module>
from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer
File "/home/user/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in <module>
from .ops.modules import MSDeformAttn
File "/home/user/Documents/AiOS/models/aios/ops/modules/__init__.py", line 9, in <module>
from .ms_deform_attn import MSDeformAttn
File "/home/user/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in <module>
from ..functions import MSDeformAttnFunction
File "/home/user/Documents/AiOS/models/aios/ops/functions/__init__.py", line 9, in <module>
from .ms_deform_attn_func import MSDeformAttnFunction
File "/home/user/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in <module>
import MultiScaleDeformableAttention as MSDA
ImportError: /home/user/miniconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE
[2024-09-27 19:16:31,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1122816) of binary: /home/user/miniconda3/envs/aios/bin/python
Traceback (most recent call last):
File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-27_19:16:31
host : user-A6000
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1122816)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```
import MultiScaleDeformableAttention as MSDA
I solved this problem by manually installing MSDA. https://blog.csdn.net/feeling0414/article/details/135081023
If this issue occurs, you can refer to this issue; it requires manually modifying the built-in code. https://github.com/open-mmlab/mmaction2/issues/1536
And then you may meet the same problem mentioned above.
File "main.py", line 390, in <module>
main(args)
File "main.py", line 292, in main
inference(model,
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/Documents/AiOS/engine.py", line 338, in inference
outputs, targets, data_batch_nc = model(data_batch)
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/distributed.py", line 165, in _run_ddp_forward
inputs, kwargs = self.to_kwargs( # type: ignore
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/distributed.py", line 27, in to_kwargs
return scatter_kwargs(inputs, kwargs, [device_id], dim=self.dim)
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 60, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 50, in scatter
return scatter_map(inputs)
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 40, in scatter_map
out = list(map(type(obj), zip(*map(scatter_map, obj.items()))))
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 33, in scatter_map
return Scatter.forward(target_gpus, obj.data)
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/_functions.py", line 75, in forward
streams = [_get_stream(device) for device in target_gpus]
File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/_functions.py", line 75, in <listcomp>
streams = [_get_stream(device) for device in target_gpus]
File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 117, in _get_stream
if device.type == "cpu":
AttributeError: 'int' object has no attribute 'type'
And then you may meet the same problem mentioned above.
File "main.py", line 390, in <module> main(args) File "main.py", line 292, in main inference(model, File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/user/Documents/AiOS/engine.py", line 338, in inference outputs, targets, data_batch_nc = model(data_batch) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/distributed.py", line 165, in _run_ddp_forward inputs, kwargs = self.to_kwargs( # type: ignore File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/distributed.py", line 27, in to_kwargs return scatter_kwargs(inputs, kwargs, [device_id], dim=self.dim) File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 60, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim) if inputs else [] File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 50, in scatter return scatter_map(inputs) File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 40, in scatter_map out = list(map(type(obj), zip(*map(scatter_map, obj.items())))) File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 33, in scatter_map return Scatter.forward(target_gpus, obj.data) File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/_functions.py", line 75, in forward streams = [_get_stream(device) for device in target_gpus] File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/_functions.py", line 75, in <listcomp> streams = [_get_stream(device) for device in target_gpus] File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 117, in _get_stream if device.type == "cpu": AttributeError: 'int' object has no attribute 'type'
Hi @formoree, did you solve it? You can refer to (https://github.com/ttxskk/AiOS/issues/16#issuecomment-2310730167)
Thank you very much for your reply. According to your instructions, I have encountered a brand new problem.
Traceback (most recent call last): File "main.py", line 390, in <module> main(args) File "main.py", line 151, in main model, criterion, postprocessors, _ = build_model_main( File "main.py", line 82, in build_model_main from models.registry import MODULE_BUILD_FUNCS File "/home/user/Documents/AiOS/models/__init__.py", line 1, in <module> from .aios import build_aios_smplx File "/home/user/Documents/AiOS/models/aios/__init__.py", line 1, in <module> from .aios_smplx import build_aios_smplx File "/home/user/Documents/AiOS/models/aios/aios_smplx.py", line 15, in <module> from .transformer import build_transformer File "/home/user/Documents/AiOS/models/aios/transformer.py", line 10, in <module> from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer File "/home/user/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in <module> from .ops.modules import MSDeformAttn File "/home/user/Documents/AiOS/models/aios/ops/modules/__init__.py", line 9, in <module> from .ms_deform_attn import MSDeformAttn File "/home/user/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in <module> from ..functions import MSDeformAttnFunction File "/home/user/Documents/AiOS/models/aios/ops/functions/__init__.py", line 9, in <module> from .ms_deform_attn_func import MSDeformAttnFunction File "/home/user/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in <module> import MultiScaleDeformableAttention as MSDA ImportError: /home/user/miniconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE [2024-09-27 19:16:31,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1122816) of binary: /home/user/miniconda3/envs/aios/bin/python Traceback (most recent call last): File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-27_19:16:31 host : user-A6000 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1122816) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================```
To resolve this error, you can follow our doc and build the ops using the following commands :
# Build deformable detr
cd models/aios/ops
python setup.py build install
cd ../../..
Thank you very much for your reply. According to your instructions, I have encountered a brand new problem.
Traceback (most recent call last): File "main.py", line 390, in <module> main(args) File "main.py", line 151, in main model, criterion, postprocessors, _ = build_model_main( File "main.py", line 82, in build_model_main from models.registry import MODULE_BUILD_FUNCS File "/home/user/Documents/AiOS/models/__init__.py", line 1, in <module> from .aios import build_aios_smplx File "/home/user/Documents/AiOS/models/aios/__init__.py", line 1, in <module> from .aios_smplx import build_aios_smplx File "/home/user/Documents/AiOS/models/aios/aios_smplx.py", line 15, in <module> from .transformer import build_transformer File "/home/user/Documents/AiOS/models/aios/transformer.py", line 10, in <module> from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer File "/home/user/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in <module> from .ops.modules import MSDeformAttn File "/home/user/Documents/AiOS/models/aios/ops/modules/__init__.py", line 9, in <module> from .ms_deform_attn import MSDeformAttn File "/home/user/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in <module> from ..functions import MSDeformAttnFunction File "/home/user/Documents/AiOS/models/aios/ops/functions/__init__.py", line 9, in <module> from .ms_deform_attn_func import MSDeformAttnFunction File "/home/user/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in <module> import MultiScaleDeformableAttention as MSDA ImportError: /home/user/miniconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE [2024-09-27 19:16:31,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1122816) of binary: /home/user/miniconda3/envs/aios/bin/python Traceback (most recent call last): File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-27_19:16:31 host : user-A6000 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1122816) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================```
To resolve this error, you can follow our doc and build the ops using the following commands :
# Build deformable detr cd models/aios/ops python setup.py build install cd ../../..
Thank you very much for your response; I had already resolved the issue at that time!
Due to my Linux's CUDA version being 12.4, I cannot download Torch 1.x, which makes it impossible to use MMCV 1.x. As you know, MMCV has very strict version requirements. Therefore, I tried using MMCV>=2.0.0, but I'm not sure if that would be compatible. Based on the results from the current attempts, it is not working, as there are persistent version-related errors.