Error running the demo - Githubissues

caolonghao commented 3 months ago

Thanks for your fantastic work, but I encountered a series of problems when running the demo. I really appreciate it if you can give me some help. Here are the problems I got: Environment Error If I follow the instructions in README to install pytorch 1.10.1 and then pytorch3d, there will be a mismatch of CUDA version error. The detected CUDA version (12.4) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.

I solved this by installing the latest pytorch 2.3.1 and manually download the pytorch3d conda package and install it. I don't know if I should install an older version of Nvidia driver on my machine.

debugpy always waiting If I don't comment the line debugpy.wait_for_client(), the code will just stop there and wait forever to expect the debugpy client to start.

def main(args):

    utils.init_distributed_mode_ssc(args)
    # utils.init_distributed_mode(args)
    if args.rank == 0:
        debugpy.listen(("127.0.0.1", 10086))
        debugpy.wait_for_client()
    print('Loading config file from {}'.format(args.config_file))
    shutil.copy2(args.config_file,'config/aios_smplx.py')

Some distributed running error If I use the default mmcv distributed in the code, I have the following error, which seems like a bug related to device type:

[rank0]: Traceback (most recent call last):
[rank0]:   File "main.py", line 395, in <module>
[rank0]:     main(args)
[rank0]:   File "main.py", line 297, in main
[rank0]:     inference(model,
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference
[rank0]:     outputs, targets, data_batch_nc = model(data_batch)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/distributed.py", line 162, in _run_ddp_forward
[rank0]:     inputs, kwargs = self.to_kwargs(  # type: ignore
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/distributed.py", line 27, in to_kwargs
[rank0]:     return scatter_kwargs(inputs, kwargs, [device_id], dim=self.dim)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 60, in scatter_kwargs
[rank0]:     inputs = scatter(inputs, target_gpus, dim) if inputs else []
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 50, in scatter
[rank0]:     return scatter_map(inputs)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
[rank0]:     return list(zip(*map(scatter_map, obj)))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 40, in scatter_map
[rank0]:     out = list(map(type(obj), zip(*map(scatter_map, obj.items()))))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
[rank0]:     return list(zip(*map(scatter_map, obj)))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 33, in scatter_map
[rank0]:     return Scatter.forward(target_gpus, obj.data)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/_functions.py", line 75, in forward
[rank0]:     streams = [_get_stream(device) for device in target_gpus]
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/_functions.py", line 75, in <listcomp>
[rank0]:     streams = [_get_stream(device) for device in target_gpus]
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 119, in _get_stream
[rank0]:     if device.type == "cpu":
[rank0]: AttributeError: 'int' object has no attribute 'type'

If I disable distributed running, another error showed up, which also seems to be related to data type convertion:

[rank0]: Traceback (most recent call last):
[rank0]:   File "main.py", line 395, in <module>
[rank0]:     main(args)
[rank0]:   File "main.py", line 297, in main
[rank0]:     inference(model,
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference
[rank0]:     outputs, targets, data_batch_nc = model(data_batch)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 962, in forward
[rank0]:     samples, targets = self.prepare_targets(data_batch)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 1845, in prepare_targets
[rank0]:     data_batch_coco = []
[rank0]: AttributeError: 'DataContainer' object has no attribute 'float'

My environment for running the code is:

OS: Ubuntu 24.04 LTS x86_64
Kernel: 6.8.0-38-generic
CPU: 13th Gen Intel i9-13900K (32) @ 5.500GHz
GPU: NVIDIA GeForce RTX 4090
Nvidia Driver Version: 550.90.07
CUDA Version: 12.4

ttxskk commented 3 months ago

Hi @caolonghao,

I haven't tested the version you installed. Our code is compatible with most versions of PyTorch and CUDA. The main issue is with PyTorch3D, which is used for vis, and we've only tested it with version 0.6.1. I think you can give it a try. If you can successfully install it, I don't think there will be major problems.
The debugpy.wait_for_client() line is for remote debugging and should have been removed.
How did you run the code? Could you please try running it using the following command:sh scripts/inference.sh data/checkpoint/aios_checkpoint.pth short_video.mp4 demo 2 0.1 1. I will update this part to support more ways to run the code.

caolonghao commented 3 months ago

Still error there, maybe you can pack up a colab demo so that it can be reproduce more easily

Traceback (most recent call last):
  File "main.py", line 389, in <module>
    main(args)
  File "main.py", line 291, in main
    inference(model,
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/engine.py", line 338, in inference
    outputs, targets, data_batch_nc = model(data_batch)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/aios_smplx.py", line 1001, in forward
    hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer.py", line 332, in forward
    memory, enc_intermediate_output, enc_intermediate_refpoints = self.encoder(
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer.py", line 642, in forward
    output = layer(src=output,
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer_deformable.py", line 62, in forward
    src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points,
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/ops/modules/ms_deform_attn.py", line 96, in forward
    value = self.value_proj(input_flatten)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

leeooo001 commented 3 months ago

rank0: Traceback (most recent call last): rank0: File "main.py", line 395, in

rank0: File "main.py", line 297, in main

rank0: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference rank0: outputs, targets, data_batch_nc = model(data_batch) rank0: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 962, in forward rank0: samples, targets = self.prepare_targets(data_batch) rank0: File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 1845, in prepare_targets rank0: data_batch_coco = []

the same problem..........

MoyGcc commented 2 months ago

@caolonghao you can still use the default mmcv distributed in the code but with modifications from https://github.com/open-mmlab/mmdetection/issues/10720 and https://github.com/HarborYuan/mmcv_16/commit/ad1a72fe0cbeead2716706ff618dfa0269d2cf4c. Then you should be good to go.

caolonghao commented 1 month ago

@caolonghao you can still use the default mmcv distributed in the code but with modifications from open-mmlab/mmdetection#10720 and HarborYuan/mmcv_16@ad1a72f. Then you should be good to go.

Thanks, this solved my problem. I installed pytorch 2.4.1-cuda 12.1 and pytorch3d from the conda file. After that, I modified mmcv like you described. Then the code can run as expected.

ttxskk / AiOS

Error running the demo #16