hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.76k stars 2.11k forks source link

undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType #460

Closed Zane0227 closed 3 months ago

Zane0227 commented 3 months ago

拉取最新代码v1.2版本后,推理能力正常,但微调报错undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType。

[rank0]: File "/data/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(*args, *kwargs) [rank0]: File "/data/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(args, **kwargs) [rank0]: File "/data/miniconda3/envs/opensora/lib/python3.9/site-packages/opensora/models/stdit/stdit3.py", line 136, in forward [rank0]: x = x + self.crossattn(x, y, mask) [rank0]: RuntimeError: r.nvmlDeviceGetNvLinkRemoteDeviceType INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":27, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

Zane0227 commented 3 months ago

拉取最新代码v1.2版本后,推理能力正常,但微调报错undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType。

[rank0]: File "/data/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(*args, *kwargs) [rank0]: File "/data/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(args, **kwargs) [rank0]: File "/data/miniconda3/envs/opensora/lib/python3.9/site-packages/opensora/models/stdit/stdit3.py", line 136, in forward [rank0]: x = x + self.crossattn(x, y, mask) [rank0]: RuntimeError: r.nvmlDeviceGetNvLinkRemoteDeviceType INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":27, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

CUDA版本12.1

zhengzangw commented 3 months ago

I did not run into this problem. According to your error, since it happens in cross_attn, I guess you have installed a wrong xformers version? Could @ver217 have a look?

Zane0227 commented 3 months ago

I did not run into this problem. According to your error, since it happens in cross_attn, I guess you have installed a wrong xformers version? Could @ver217 have a look?

thanks for reply,my xformer version is 0.0.26.post1,which version should I use?

Zane0227 commented 3 months ago

Use torch version 2.1.0, xformer 0.0.22.post4. Problem solved. Driver Version: 450.80.02 is too low for torch 2.3.0

SiyangJ commented 1 week ago

Use torch version 2.1.0, xformer 0.0.22.post4. Problem solved. Driver Version: 450.80.02 is too low for torch 2.3.0

Hi I have recently encountered the same problem. I'm tempted to agree that the low driver version is causing the problem. But I can't upgrade the driver version in my case. Have you tried other ways to solve it? Do you think there are other ways to get around it?