ttxskk / AiOS

[CVPR 2024] Official Code for "AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation
https://ttxskk.github.io/AiOS/
Other
222 stars 2 forks source link

Can AiOS model support mmcv>=2.0.0? #24

Closed formoree closed 4 weeks ago

formoree commented 4 weeks ago

Due to my Linux's CUDA version being 12.4, I cannot download Torch 1.x, which makes it impossible to use MMCV 1.x. As you know, MMCV has very strict version requirements. Therefore, I tried using MMCV>=2.0.0, but I'm not sure if that would be compatible. Based on the results from the current attempts, it is not working, as there are persistent version-related errors.

ttxskk commented 4 weeks ago

Hi @formoree, I apologize for the inconvenience. Since I primarily use PyTorch 1.x, I haven't met this issue before. I tried installing PyTorch 2.1 with the following command:

conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

I also modified the source code of MMCV following this link, and it worked well. You can try it.

I've reviewed the code and noticed that some dependencies, such as MMCV, could be removed to make the code compatible with PyTorch 2.x. However, I don't have enough time to implement it, so I can't say when I’ll be able to release it.

Looking forward to your feedback.

formoree commented 4 weeks ago

Thank you very much for your reply. According to your instructions, I have encountered a brand new problem.


Traceback (most recent call last):
  File "main.py", line 390, in <module>
    main(args)
  File "main.py", line 151, in main
    model, criterion, postprocessors, _ = build_model_main(
  File "main.py", line 82, in build_model_main
    from models.registry import MODULE_BUILD_FUNCS
  File "/home/user/Documents/AiOS/models/__init__.py", line 1, in <module>
    from .aios import build_aios_smplx
  File "/home/user/Documents/AiOS/models/aios/__init__.py", line 1, in <module>
    from .aios_smplx import build_aios_smplx
  File "/home/user/Documents/AiOS/models/aios/aios_smplx.py", line 15, in <module>
    from .transformer import build_transformer
  File "/home/user/Documents/AiOS/models/aios/transformer.py", line 10, in <module>
    from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer
  File "/home/user/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in <module>
    from .ops.modules import MSDeformAttn
  File "/home/user/Documents/AiOS/models/aios/ops/modules/__init__.py", line 9, in <module>
    from .ms_deform_attn import MSDeformAttn
  File "/home/user/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in <module>
    from ..functions import MSDeformAttnFunction
  File "/home/user/Documents/AiOS/models/aios/ops/functions/__init__.py", line 9, in <module>
    from .ms_deform_attn_func import MSDeformAttnFunction
  File "/home/user/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in <module>
    import MultiScaleDeformableAttention as MSDA
ImportError: /home/user/miniconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE
[2024-09-27 19:16:31,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1122816) of binary: /home/user/miniconda3/envs/aios/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-27_19:16:31
  host      : user-A6000
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1122816)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```
formoree commented 4 weeks ago

import MultiScaleDeformableAttention as MSDA

I solved this problem by manually installing MSDA. https://blog.csdn.net/feeling0414/article/details/135081023

formoree commented 4 weeks ago

If this issue occurs, you can refer to this issue; it requires manually modifying the built-in code. https://github.com/open-mmlab/mmaction2/issues/1536

formoree commented 4 weeks ago

And then you may meet the same problem mentioned above.

File "main.py", line 390, in <module>
    main(args)
  File "main.py", line 292, in main
    inference(model,
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Documents/AiOS/engine.py", line 338, in inference
    outputs, targets, data_batch_nc = model(data_batch)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/distributed.py", line 165, in _run_ddp_forward
    inputs, kwargs = self.to_kwargs(  # type: ignore
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/distributed.py", line 27, in to_kwargs
    return scatter_kwargs(inputs, kwargs, [device_id], dim=self.dim)
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 60, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 50, in scatter
    return scatter_map(inputs)
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 40, in scatter_map
    out = list(map(type(obj), zip(*map(scatter_map, obj.items()))))
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 33, in scatter_map
    return Scatter.forward(target_gpus, obj.data)
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/_functions.py", line 75, in forward
    streams = [_get_stream(device) for device in target_gpus]
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/_functions.py", line 75, in <listcomp>
    streams = [_get_stream(device) for device in target_gpus]
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 117, in _get_stream
    if device.type == "cpu":
AttributeError: 'int' object has no attribute 'type'
ttxskk commented 4 weeks ago

And then you may meet the same problem mentioned above.

File "main.py", line 390, in <module>
    main(args)
  File "main.py", line 292, in main
    inference(model,
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Documents/AiOS/engine.py", line 338, in inference
    outputs, targets, data_batch_nc = model(data_batch)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/distributed.py", line 165, in _run_ddp_forward
    inputs, kwargs = self.to_kwargs(  # type: ignore
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/distributed.py", line 27, in to_kwargs
    return scatter_kwargs(inputs, kwargs, [device_id], dim=self.dim)
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 60, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 50, in scatter
    return scatter_map(inputs)
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 40, in scatter_map
    out = list(map(type(obj), zip(*map(scatter_map, obj.items()))))
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/scatter_gather.py", line 33, in scatter_map
    return Scatter.forward(target_gpus, obj.data)
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/_functions.py", line 75, in forward
    streams = [_get_stream(device) for device in target_gpus]
  File "/home/user/Documents/AiOS/mmcv/mmcv/parallel/_functions.py", line 75, in <listcomp>
    streams = [_get_stream(device) for device in target_gpus]
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 117, in _get_stream
    if device.type == "cpu":
AttributeError: 'int' object has no attribute 'type'

Hi @formoree, did you solve it? You can refer to (https://github.com/ttxskk/AiOS/issues/16#issuecomment-2310730167)

ttxskk commented 4 weeks ago

Thank you very much for your reply. According to your instructions, I have encountered a brand new problem.

Traceback (most recent call last):
  File "main.py", line 390, in <module>
    main(args)
  File "main.py", line 151, in main
    model, criterion, postprocessors, _ = build_model_main(
  File "main.py", line 82, in build_model_main
    from models.registry import MODULE_BUILD_FUNCS
  File "/home/user/Documents/AiOS/models/__init__.py", line 1, in <module>
    from .aios import build_aios_smplx
  File "/home/user/Documents/AiOS/models/aios/__init__.py", line 1, in <module>
    from .aios_smplx import build_aios_smplx
  File "/home/user/Documents/AiOS/models/aios/aios_smplx.py", line 15, in <module>
    from .transformer import build_transformer
  File "/home/user/Documents/AiOS/models/aios/transformer.py", line 10, in <module>
    from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer
  File "/home/user/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in <module>
    from .ops.modules import MSDeformAttn
  File "/home/user/Documents/AiOS/models/aios/ops/modules/__init__.py", line 9, in <module>
    from .ms_deform_attn import MSDeformAttn
  File "/home/user/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in <module>
    from ..functions import MSDeformAttnFunction
  File "/home/user/Documents/AiOS/models/aios/ops/functions/__init__.py", line 9, in <module>
    from .ms_deform_attn_func import MSDeformAttnFunction
  File "/home/user/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in <module>
    import MultiScaleDeformableAttention as MSDA
ImportError: /home/user/miniconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE
[2024-09-27 19:16:31,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1122816) of binary: /home/user/miniconda3/envs/aios/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-27_19:16:31
  host      : user-A6000
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1122816)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```

To resolve this error, you can follow our doc and build the ops using the following commands :

# Build deformable detr
cd models/aios/ops
python setup.py build install
cd ../../..
formoree commented 4 weeks ago

Thank you very much for your reply. According to your instructions, I have encountered a brand new problem.

Traceback (most recent call last):
  File "main.py", line 390, in <module>
    main(args)
  File "main.py", line 151, in main
    model, criterion, postprocessors, _ = build_model_main(
  File "main.py", line 82, in build_model_main
    from models.registry import MODULE_BUILD_FUNCS
  File "/home/user/Documents/AiOS/models/__init__.py", line 1, in <module>
    from .aios import build_aios_smplx
  File "/home/user/Documents/AiOS/models/aios/__init__.py", line 1, in <module>
    from .aios_smplx import build_aios_smplx
  File "/home/user/Documents/AiOS/models/aios/aios_smplx.py", line 15, in <module>
    from .transformer import build_transformer
  File "/home/user/Documents/AiOS/models/aios/transformer.py", line 10, in <module>
    from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer
  File "/home/user/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in <module>
    from .ops.modules import MSDeformAttn
  File "/home/user/Documents/AiOS/models/aios/ops/modules/__init__.py", line 9, in <module>
    from .ms_deform_attn import MSDeformAttn
  File "/home/user/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in <module>
    from ..functions import MSDeformAttnFunction
  File "/home/user/Documents/AiOS/models/aios/ops/functions/__init__.py", line 9, in <module>
    from .ms_deform_attn_func import MSDeformAttnFunction
  File "/home/user/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in <module>
    import MultiScaleDeformableAttention as MSDA
ImportError: /home/user/miniconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE
[2024-09-27 19:16:31,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1122816) of binary: /home/user/miniconda3/envs/aios/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/aios/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-27_19:16:31
  host      : user-A6000
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1122816)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```

To resolve this error, you can follow our doc and build the ops using the following commands :

# Build deformable detr
cd models/aios/ops
python setup.py build install
cd ../../..

Thank you very much for your response; I had already resolved the issue at that time!