mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.79k stars 1.53k forks source link

[Bug] `mlc_llm serve` throws `CUDA: invalid device ordinal` #2498

Closed josephrocca closed 2 months ago

josephrocca commented 4 months ago

🐛 Bug

I found this repo on Huggingface, kindly publicly shared by @bayley who also provided the commands for serving, but upon using those commands, I get an error: CUDA: invalid device ordinal. I tried using mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC instead and it works fine.

Could it be that each version of mlc requires a new MLC model conversion? If that is likely the cause, then:

  1. How would I pip install a version of mlc from 25 days ago when bayley uploaded the model?
  2. Could the error message be made more informative - e.g. "This model was converted with an older version of MLC. Please install that version of MLC with [...] or re-convert the model."

To Reproduce

docker run -it --rm --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 bash
apt-get update && apt-get install tmux python3 python3-pip git git-lfs -y
git-lfs install
python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122
tmux new -s server
cd workspace
git clone https://huggingface.co/bayley/Midnight-Miqu-70B-v1.5-q4f16_1-MLC
mlc_llm compile Midnight-Miqu-70B-v1.5-q4f16_1-MLC/mlc-chat-config.json --device cuda --overrides "tensor_parallel_shards=2" -o /workspace/Midnight-Miqu-70B-v1.5-q4f16_1-cuda.so
mlc_llm serve /workspace/Midnight-Miqu-70B-v1.5-q4f16_1-MLC --model-lib /workspace/Midnight-Miqu-70B-v1.5-q4f16_1-cuda.so --host 0.0.0.0 --mode server --port 8000
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/worker.py", line 54, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/worker.py", line 49, in main
    worker_func(worker_id, num_workers, reader, writer)  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__  File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):  6: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (int, int, long, long)>::AssignTypedLambda<void (*)(int,int, long, long)>(void (*)(int, int, long, long), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)  5: tvm::runtime::WorkerProcess(int, int, long, long)
  4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
  3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
  2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<cha
r, std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std:
:allocator<char> >)>(void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::__cxx11::basic_string<ch
ar, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*,
tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/runtime/disco/nccl/nccl_context.h", line 65
InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: invalid device ordinal

Environment

This Docker image:

nvidia/cuda:12.1.1-devel-ubuntu22.04
tqchen commented 4 months ago

likely it was due to config being compiled for higher TP(leveraging multiple gpus) but you don't have as enough gpu

bayley commented 4 months ago

yeah the model was compiled for 4x GPU. I'll upload ones for 2x GPU in a bit or you can compile them yourself (it's not hard, just a bit inconvenient for the bigger models)

josephrocca commented 4 months ago

Ah, thank you both.

@bayley A 2x one would be great if it's not too much trouble! Or can I just change the tensor_parallel_shards values in the config file to 2? I just tried that, and also reduced all the 8192 values relating to context length (i.e. everything except hidden_size) down to 4096, but am running into:

TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 20584.716 MB, which is less than the sum of model weight size (18643.766 MB) and temporary buffer size (2977.048 MB).

which seems a bit strange because there should be ~24GB of memory (i.e. significantly more than 20584.716 MB) available in each of the two 4090 GPUs. So my guess here is that simply changing the tensor_parallel_shards values in the config file is not a valid approach.

bayley commented 4 months ago

oh hmm, I just realized the model quants are independent of the TP rank outside of the config file, so something else is going on here - @tqchen is --overrides "tensor_parallel_shards=2" the correct flag to set the number of TP shards during compile time?

MasterJH5574 commented 3 months ago

Is --overrides "tensor_parallel_shards=2" the correct flag to set the number of TP shards during compile time?

@bayley Yes that's true. Sorry for the late response.

bayley commented 3 months ago

The chat config in the HF repo was generated with TP=4, besides the TP override during compile do any other edits need to be made to the config for TP=2?