Closed jeejeelee closed 6 months ago
The error occurred while executing this code. I have tested this on different devices, and all encountered the same error.
@jeejeelee Could you put up the whole error log?
@jeejeelee Could you put up the whole error log?
Thanks , the whole error log as follow:
[1] 2297008 bus error (core dumped) pytest test_model_runner.py
@jeejeelee 🥲 Is it good for you to run the vllm engine? Is that error only related to test_model_runner?
Is it good for you to run the vllm engine? Is that error only related to test_model_runner?
Nope, actually, I encountered this error while running offline_inference.py
, and then encountered the same error when running pytest test_model_runner.py
.
By the way, I'm curious if you encountered this error while building the source from source.
It's possible it's an issue with my environment, but I've tested it on both RTX3090 and A800, and the error occurred in both cases
I also ran the following command:
ldd libnccl.so.2.18.1
And encountered an error in the result:
ldd: exited with unknown exit code (135)
Thanks for your response
After reinstalling NCCL from https://github.com/NVIDIA/nccl, I successfully resolved this error
I also ran the following command:
ldd libnccl.so.2.18.1
And encountered an error in the result:
ldd: exited with unknown exit code (135)
This usually means the library is corrupted. Glad that it is resolved by reinstalling nccl 👍
@youkaichao Thanks for your reponse.
Although reinstalling solved the error, I am curious about where the libnccl.so.2.18.
in .config/vllm/nccl/cu12
came from. It should be libnccl.so.2.18
that caused this error , so replacing it after reinstalling nccl fixed this error.
It is downloaded from https://developer.download.nvidia.com/compute/redist/nccl/ .
Can you give more details of your environment? I need to check what caused the problem. The downloaded nccl should work for x86_64
in general.
You can report your environment by executing https://github.com/vllm-project/vllm/blob/main/collect_env.py .
@youkaichao
As mentioned earlier, the environment information I provided was generated using collect_env.py. I am glad to assist you in analyzing this problem, if you need any additional information from me,feel free to reach out to me
Uh, I see. Can you please download and unzip this wheel https://pypi.org/project/nvidia-nccl-cu12/2.18.3/#files , and see if ldd
for the .so
file inside also core dump?
@youkaichao
$ unzip nvidia_nccl_cu12-2.18.3-py3-none-manylinux1_x86_64.whl
Archive: nvidia_nccl_cu12-2.18.3-py3-none-manylinux1_x86_64.whl
inflating: nvidia/__init__.py
inflating: nvidia/nccl/__init__.py
inflating: nvidia/nccl/include/__init__.py
inflating: nvidia/nccl/include/nccl.h
inflating: nvidia/nccl/include/nccl_net.h
inflating: nvidia/nccl/lib/__init__.py
inflating: nvidia/nccl/lib/libnccl.so.2
inflating: nvidia_nccl_cu12-2.18.3.dist-info/License.txt
inflating: nvidia_nccl_cu12-2.18.3.dist-info/METADATA
inflating: nvidia_nccl_cu12-2.18.3.dist-info/WHEEL
inflating: nvidia_nccl_cu12-2.18.3.dist-info/top_level.txt
inflating: nvidia_nccl_cu12-2.18.3.dist-info/RECORD
and then:
ldd libnccl.so.2
linux-vdso.so.1 (0x00007ffffb1bc000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f16d3bbd000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f16d3bb3000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f16d3bad000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f16d39cb000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f16d387c000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f16d3861000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f16d366d000)
/lib64/ld-linux-x86-64.so.2 (0x00007f16e4853000)
The ldd result looks like correct. Actually, I build the source by the following command:
VLLM_INSTALL_PUNICA_KERNELS=1 pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple
the installed vllm-nccl-cu12 is
vllm-nccl-cu12 2.18.1.0.1.0
The size of the libnccl.so
you provided the link for is 277MB, whereas the one I downloaded is only 45MB.
The reason might be the addition of the Tsinghua mirror ?
whereas the one I downloaded is only 45MB.
Which one is 45MB? I checked the https://pypi.tuna.tsinghua.edu.cn/simple/nvidia-nccl-cu12/ , and the file is over 200MB.
You can see the installation script at https://github.com/vllm-project/vllm-nccl/blob/main/setup.py , can you try to download it yourself? It might be some problem of your installation, e.g. incomplete download?
@youkaichao
The nccl.so
installed using the Tsinghua mirror only occupy 45MB.
The problem indeed arose due to incomplete downloads.
It appears that the issue was indeed related to my network. After reconfiguring my network settings and reinstalling vllm, nccl.so
is now displaying correctly when using ldd, and I am no longer encountering the previously mentioned core dumped
.
Thank you once again for your helpful feedback.
@youkaichao The
nccl.so
installed using the Tsinghua mirror only occupy 45MB.The problem indeed arose due to incomplete downloads. It appears that the issue was indeed related to my network. After reconfiguring my network settings and reinstalling vllm,
nccl.so
is now displaying correctly when using ldd, and I am no longer encountering the previously mentionedcore dumped
.Thank you once again for your helpful feedback.
how to fix this error! change to aliyun mirror ?
@youkaichao The
nccl.so
installed using the Tsinghua mirror only occupy 45MB. The problem indeed arose due to incomplete downloads. It appears that the issue was indeed related to my network. After reconfiguring my network settings and reinstalling vllm,nccl.so
is now displaying correctly when using ldd, and I am no longer encountering the previously mentionedcore dumped
. Thank you once again for your helpful feedback.how to fix this error! change to aliyun mirror ?
Reinstalling NCCL from https://github.com/NVIDIA/nccl
Your current environment
🐛 Describe the bug
I build the vllm from source, and encountered the following error:
Minimal code to reproduce the error: