mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.82k stars 1.54k forks source link

[Bug] CUDA: out of memory on dual gpu #2492

Closed pw-k closed 4 months ago

pw-k commented 4 months ago

🐛 Bug

I have dual rtx 3090. Compiled model with command: mlc_llm compile Llama-3-70B-Instruct-q4f16_1-MLC/mlc-chat-config.json --device cuda --overrides "tensor_parallel_shards=2" -o Llama-3-70B-Instruct-q4f16_1-cuda.so

Running command: mlc_llm serve Llama-3-70B-Instruct-q4f16_1-MLC --model-lib Llama-3-70B-Instruct-q4f16_1-cuda.so --host 0.0.0.0

Ends with error: ValueError: Error when loading parameters from params_shard_299.bin: [08:29:09] /workspace/tvm/src/runtime/cuda/cuda_device_api.cc:145: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

Watching nvida-smi shows me that it fills up memory in first card, while second is unused. And after filling up it dies with error above. I can run smaller models but I can't use both GPUs.

What am I missing? how to use both GPUs?

To Reproduce

Steps to reproduce the behavior:

  1. git clone https://huggingface.co/mlc-ai/Llama-3-70B-Instruct-q4f16_1-MLC
  2. mlc_llm compile Llama-3-70B-Instruct-q4f16_1-MLC/mlc-chat-config.json --device cuda --overrides "tensor_parallel_shards=2" -o Llama-3-70B-Instruct-q4f16_1-cuda.so
  3. mlc_llm serve Llama-3-70B-Instruct-q4f16_1-MLC --model-lib Llama-3-70B-Instruct-q4f16_1-cuda.so --host 0.0.0.0

[2024-06-03 08:28:24] INFO auto_device.py:79: Found device: cuda:0 [2024-06-03 08:28:24] INFO auto_device.py:79: Found device: cuda:1 [2024-06-03 08:28:25] INFO auto_device.py:88: Not found device: rocm:0 [2024-06-03 08:28:26] INFO auto_device.py:88: Not found device: metal:0 [2024-06-03 08:28:27] INFO auto_device.py:79: Found device: vulkan:0 [2024-06-03 08:28:27] INFO auto_device.py:79: Found device: vulkan:1 [2024-06-03 08:28:28] INFO auto_device.py:88: Not found device: opencl:0 [2024-06-03 08:28:28] INFO auto_device.py:35: Using device: cuda:0 [2024-06-03 08:28:28] INFO engine_base.py:141: Using library model: Llama-3-70B-Instruct-q4f16_1-cuda.so [08:28:29] /workspace/mlc-llm/cpp/serve/config.cc:646: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 2000, prefill chunk size will be set to 2000. [08:28:29] /workspace/mlc-llm/cpp/serve/config.cc:646: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 2038, prefill chunk size will be set to 2038. [08:28:29] /workspace/mlc-llm/cpp/serve/config.cc:646: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 1037, prefill chunk size will be set to 2048. [08:28:29] /workspace/mlc-llm/cpp/serve/config.cc:726: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 2038, prefill chunk size is 2038. [08:28:29] /workspace/mlc-llm/cpp/serve/config.cc:731: Estimated total single GPU memory usage: 20614.709 MB (Parameters: 19489.766 MB. KVCache: 402.956 MB. Temporary buffer: 721.988 MB). The actual usage might be slightly larger than the estimated number. Exception in thread Thread-1: Traceback (most recent call last): File "/home/test/miniconda3/envs/mlc2/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/home/test/miniconda3/envs/mlc2/lib/python3.11/threading.py", line 982, in run self._target(*self._args, **self._kwargs) File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__ File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3 File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/home/test/miniconda3/envs/mlc2/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "/workspace/mlc-llm/cpp/serve/threaded_engine.cc", line 156, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop() File "/workspace/mlc-llm/cpp/serve/threaded_engine.cc", line 269, in mlc::llm::serve::ThreadedEngineImpl::EngineReloadImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) File "/workspace/mlc-llm/cpp/serve/engine.cc", line 800, in mlc::llm::serve::Engine::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice, tvm::runtime::TypedPackedFunc<void (tvm::runtime::Array<mlc::llm::serve::RequestStreamOutput, void>)>, tvm::runtime::Optional<mlc::llm::serve::EventTraceRecorder>) File "/workspace/mlc-llm/cpp/serve/engine.cc", line 341, in mlc::llm::serve::EngineImpl::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice, tvm::runtime::TypedPackedFunc<void (tvm::runtime::Array<mlc::llm::serve::RequestStreamOutput, void>)>, tvm::runtime::Optional<mlc::llm::serve::EventTraceRecorder>) File "/workspace/mlc-llm/cpp/serve/model.cc", line 666, in mlc::llm::serve::ModelImpl::LoadParams() File "/workspace/mlc-llm/cpp/serve/function_table.cc", line 176, in mlc::llm::serve::FunctionTable::LoadParams(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice) ValueError: Traceback (most recent call last): 8: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop() at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:156 7: mlc::llm::serve::ThreadedEngineImpl::EngineReloadImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:269 6: mlc::llm::serve::Engine::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice, tvm::runtime::TypedPackedFunc<void (tvm::runtime::Array<mlc::llm::serve::RequestStreamOutput, void>)>, tvm::runtime::Optional<mlc::llm::serve::EventTraceRecorder>) at /workspace/mlc-llm/cpp/serve/engine.cc:800 5: mlc::llm::serve::EngineImpl::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice, tvm::runtime::TypedPackedFunc<void (tvm::runtime::Array<mlc::llm::serve::RequestStreamOutput, void>)>, tvm::runtime::Optional<mlc::llm::serve::EventTraceRecorder>) at /workspace/mlc-llm/cpp/serve/engine.cc:341 4: mlc::llm::serve::ModelImpl::LoadParams() at /workspace/mlc-llm/cpp/serve/model.cc:666 3: mlc::llm::serve::FunctionTable::LoadParams(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice) at /workspace/mlc-llm/cpp/serve/function_table.cc:176 2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>::AssignTypedLambda<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>(void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 1: tvm::runtime::relax_vm::NDArrayCache::Load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int) 0: _ZN3tvm7runtime6deta 13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop() at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:156 12: mlc::llm::serve::ThreadedEngineImpl::EngineReloadImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:269 11: mlc::llm::serve::Engine::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice, tvm::runtime::TypedPackedFunc<void (tvm::runtime::Array<mlc::llm::serve::RequestStreamOutput, void>)>, tvm::runtime::Optional<mlc::llm::serve::EventTraceRecorder>) at /workspace/mlc-llm/cpp/serve/engine.cc:800 10: mlc::llm::serve::EngineImpl::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice, tvm::runtime::TypedPackedFunc<void (tvm::runtime::Array<mlc::llm::serve::RequestStreamOutput, void>)>, tvm::runtime::Optional<mlc::llm::serve::EventTraceRecorder>) at /workspace/mlc-llm/cpp/serve/engine.cc:341 9: mlc::llm::serve::ModelImpl::LoadParams() at /workspace/mlc-llm/cpp/serve/model.cc:666 8: mlc::llm::serve::FunctionTable::LoadParams(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DLDevice) at /workspace/mlc-llm/cpp/serve/function_table.cc:176 7: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>::AssignTypedLambda<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>(void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 6: tvm::runtime::relax_vm::NDArrayCache::Load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int) 5: tvm::runtime::relax_vm::NDArrayCacheMetadata::FileRecord::Load(DLDevice, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, tvm::runtime::Optional<tvm::runtime::NDArray>*) const 4: tvm::runtime::relax_vm::NDArrayCacheMetadata::FileRecord::ParamRecord::Load(DLDevice, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, tvm::runtime::Optional<tvm::runtime::NDArray>*) const 3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional<tvm::runtime::String>) 2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const*, DLDataType, tvm::runtime::Optional<tvm::runtime::String>) 1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/relax_vm/ndarray_cache_support.cc", line 255 ValueError: Error when loading parameters from params_shard_299.bin: [08:29:09] /workspace/tvm/src/runtime/cuda/cuda_device_api.cc:145: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

Environment

vinx13 commented 4 months ago

Did you specify --tensor-parellel-shards when running mlc_llm gen_config? This is the tensor parallel setting for server

pw-k commented 4 months ago

Thanks! That was my mistake.