Open WJMacro opened 5 months ago
Can you try to follow pytorch and build a cuda extension? https://pytorch.org/tutorials/advanced/cpp_extension.html
Seems like your environment is broken.
/home/jhwu/anaconda3/envs/myvllm/bin/python
Your python interpreter is here
/tmp/pip-build-env-wpkamc65/overlay/lib/python3.9/site-packages/torch/lib/libtorch.so
But it finds torch in some strange path.
Sorry, I'm not familar with CUDA and don't know how to compile /csrc/moe/topk_softmax_kernels.cu.o.
Initially, I thought the strange torch path is caused by not installing torch before building vlllm or the torch is installed with conda not pip. Therefor, I tried to install torch=2.2.1+cu121 with pip follwing the pytorch's official guidence in advance. However, it seeems the script still finds torch in a pip-build-env.
Same problem here.
The torch path CMake found is wired as Found Torch: /tmp/pip-build-env-6lfm8tt6/overlay/lib/python3.9/site-packages/torch/lib/libtorch.so
, but not the torch path inside my conda env.
BTW, the error when running the build of this wired torch is cause by the incompatible cuda-12.1 and pybind11
/tmp/pip-build-env-wpkamc65/overlay/lib/python3.9/site-packages/torch/include/pybind11/cast.h:45:120: error: expected template-name before ‘<’ token
45 | return caster.operator typename make_caster<T>::template cast_op_type<T>();
| ^
/tmp/pip-build-env-wpkamc65/overlay/lib/python3.9/site-packages/torch/include/pybind11/cast.h:45:120: error: expected identifier before ‘<’ token
/tmp/pip-build-env-wpkamc65/overlay/lib/python3.9/site-packages/torch/include/pybind11/cast.h:45:123: error: expected primary-expression before ‘>’ token
45 | return caster.operator typename make_caster<T>::template cast_op_type<T>();
| ^
/tmp/pip-build-env-wpkamc65/overlay/lib/python3.9/site-packages/torch/include/pybind11/cast.h:45:126: error: expected primary-expression before ‘)’ token
45 | return caster.operator typename make_caster<T>::template cast_op_type<T>();
| ^
ninja: build stopped: subcommand failed.
moe seems to be the cultprint
Same here, occurring with the latest commits. Ubuntu 22.04.4, Python 3.10, CUDA 12.2, NVIDIA-SMI 535.171.04, torch-2.2.1, gcc 12.3.0
Just installed torch and cudatoolkit a few days ago with previous vLLM commit which worked. "pip install vllm" works too. Installing from latest commit fails:
pip install git+https://github.com/vllm-project/vllm.git@96e90fdeb3c4ebacfe24513556afccb918722b7c
->
File "/home/antti/anaconda3/envs/xx/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', '_moe_C', '-j', '32']' returned non-zero exit status 1
I am getting the same exact error and I have been able to build previous versions of vllm
using my current configuration.
same error, previous version of vllm is fine
yup same issue here
the same problem
I have a similar issue when using gcc-13.2 (manually compiled as Fedora 40 only ships gcc-14) and CUDA 12.4. WHen building vllm, one of the headers in pytorch (./ATen/core/boxing/impl/boxing.h
) caused template errors much like the error in this issue. I've build a workaround by downloading the fixed file from the PyTorch Git (wget https://raw.githubusercontent.com/pytorch/pytorch/main/aten/src/ATen/core/boxing/impl/boxing.h -O boxing.h
) and then adding to the CMakeLists.txt
after find_package(Torch REQUIRED)
:
list(GET TORCH_INCLUDE_DIRS 0 BASE_INCLUDE)
message("PATCHING: ${BASE_INCLUDE}/ATen/core/boxing/impl/boxing.h")
file(COPY ${CMAKE_SOURCE_DIR}/boxing.h DESTINATION ${BASE_INCLUDE}/ATen/core/boxing/impl/boxing.h)
I think, if you adapt the paths, you could use a similar fix to inject patched pybind header files.
try export MAX_JOBS=6
, may be helpful
same error
Your current environment
How you are installing vllm