Closed caolonghao closed 1 month ago
Hi @caolonghao,
sh scripts/inference.sh data/checkpoint/aios_checkpoint.pth short_video.mp4 demo 2 0.1 1
. I will update this part to support more ways to run the code.Still error there, maybe you can pack up a colab demo so that it can be reproduce more easily
Traceback (most recent call last):
File "main.py", line 389, in <module>
main(args)
File "main.py", line 291, in main
inference(model,
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/engine.py", line 338, in inference
outputs, targets, data_batch_nc = model(data_batch)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/aios_smplx.py", line 1001, in forward
hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer.py", line 332, in forward
memory, enc_intermediate_output, enc_intermediate_refpoints = self.encoder(
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer.py", line 642, in forward
output = layer(src=output,
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer_deformable.py", line 62, in forward
src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points,
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/ops/modules/ms_deform_attn.py", line 96, in forward
value = self.value_proj(input_flatten)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
rank0: Traceback (most recent call last):
rank0: File "main.py", line 395, in
rank0: File "main.py", line 297, in main
rank0: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference rank0: outputs, targets, data_batch_nc = model(data_batch) rank0: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 962, in forward rank0: samples, targets = self.prepare_targets(data_batch) rank0: File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 1845, in prepare_targets rank0: data_batch_coco = []
the same problem..........
@caolonghao you can still use the default mmcv distributed in the code but with modifications from https://github.com/open-mmlab/mmdetection/issues/10720 and https://github.com/HarborYuan/mmcv_16/commit/ad1a72fe0cbeead2716706ff618dfa0269d2cf4c. Then you should be good to go.
@caolonghao you can still use the default mmcv distributed in the code but with modifications from open-mmlab/mmdetection#10720 and HarborYuan/mmcv_16@ad1a72f. Then you should be good to go.
Thanks, this solved my problem. I installed pytorch 2.4.1-cuda 12.1 and pytorch3d from the conda file. After that, I modified mmcv like you described. Then the code can run as expected.
Thanks for your fantastic work, but I encountered a series of problems when running the demo. I really appreciate it if you can give me some help. Here are the problems I got: Environment Error If I follow the instructions in README to install pytorch 1.10.1 and then pytorch3d, there will be a mismatch of CUDA version error.
The detected CUDA version (12.4) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.
I solved this by installing the latest pytorch 2.3.1 and manually download the pytorch3d conda package and install it. I don't know if I should install an older version of Nvidia driver on my machine.
debugpy always waiting If I don't comment the line
debugpy.wait_for_client()
, the code will just stop there and wait forever to expect the debugpy client to start.Some distributed running error If I use the default mmcv distributed in the code, I have the following error, which seems like a bug related to device type:
If I disable distributed running, another error showed up, which also seems to be related to data type convertion:
My environment for running the code is: