Closed lhl closed 1 year ago
Similar issue at CUDA(11.8), RTX3090.
Convert was successful, but mlc_chat does not work.
Here's full log:
(mlc-llm) w3090 :: ~/coding/mlc-llm » python3 -m mlc_chat.rest --model dist/Llama-2-ko-7b-Chat-q4f16_1/params --lib-path dist/Llama-2-ko-7b-Chat-q4f16_1/Llama-2-ko-7b-Chat-q4f16_1-cuda.so --host 0.0.0.0
INFO: Started server process [1238047]
INFO: Waiting for application startup.
System automatically detected device: cuda
Using model folder: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/params
Using mlc chat config: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/params/mlc-chat-config.json
Using library model: dist/Llama-2-ko-7b-Chat-q4f16_1/Llama-2-ko-7b-Chat-q4f16_1-cuda.so
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
await super().__call__(scope, receive, send)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
raw_response = await run_endpoint_function(
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
return await dependant.call(**values)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/mlc_chat/rest.py", line 186, in request_completion
msg = session["chat_mod"].generate(prompt=request.messages[0].content)
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 657, in generate
self._decode()
File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 859, in _decode
self._decode_func()
File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 262, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 251, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
tvm._ffi.base.TVMError: Traceback (most recent call last):
10: TVMFuncCall
9: mlc::llm::LLMChat::DecodeStep()
at /workspace/mlc-llm/cpp/llm_chat.cc:638
8: mlc::llm::LLMChat::ForwardTokens(std::vector<int, std::allocator<int> >, long)
at /workspace/mlc-llm/cpp/llm_chat.cc:843
7: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
6: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)
4: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
3: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::__mk_TVM11::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::relax_vm::__mk_TVM11, tvm::runtime::TVMRetValue)
1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
0: _ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/library_module.cc", line 87
TVMError: Assert fail: T.Cast("int32", split_rotary_cos_h_shape[0]) == -1, Argument split_rotary.cos_h.shape[0] has an unsatisfied constraint: -1 == T.Cast("int32", split_rotary_cos_h_shape[0])
I experienced the same problem with vicuna-13b-v1.5-q4f16_1
, vicuna-13b-v1.3-q4f16_1
, vicuna-13b-v1.1-q4f16_1
, and vicuna-7b-v1.5-q4f16_1
but not vicuna-7b-v1.3-q4f16_1
. Just to clarify: also using ROCm, but a 7900XT.
I compiled tvm-unity with rocm support for gfx1030 (6800 xt gpu).
I get the exact same error when trying to use mlc-llm chat with the llama2 models.
Thanks for the reporting, should be fixed in https://github.com/mlc-ai/mlc-llm/pull/797
Thanks for reporting glad that it get resolved
@lhl did you try to run this fix on Ryzen 7940HS / Radeon 780M? Does it work?
@lhl did you try to run this fix on Ryzen 7940HS / Radeon 780M? Does it work?
Yes, I was able to get it working. From my testing back in September, the ROCm version was 5X faster than CPU and ~4X faster than CLBlast for prefill, but still just bandwidth limited at batch=1. Results of benchmark testing here: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1041125589
The biggest limitation of the ROCm version is it can only use GART VRAM, although there may be workarounds w/ a PyTorch GTT allocator, I haven't tried. Some notes I made here: https://llm-tracker.info/books/howto-guides/page/amd-gpus#bkmrk-amd-apu
@lhl Can you tell me how to make pytorch recognize 780m correctly? My rocm cannot find my 780m after installation
🐛 Bug
After building a custom ROCm TVM, compiling a model, and building a custom mlc-chat-cli, I end up with and assert error when trying to inference:
This might be a TVM issue? I am using ROCm 5.6 and using HSA_OVERRIDE_GFX_VERSION=11.0.0 (The Radeon 780M is gfx1103 / gfx1103_r1) so it could be a ROCm issue, although I was able to get ExLlama running...
Some related bugs:
783 - I ran into some issues w/ a prebuilt Vulkan model
782 - I ran into this issue using ROCm's LLVM and was able to fix by using conda's
llvmdev
to get LLVM 16 (Arch still uses LLVM 15)To Reproduce
I am using ROCm 5.6 on Arch (installed via
rocm-hip-sdk 5.6.0-1
).I have a clean conda env:
For TVM w/ ROCm, it'd be nice if we could use the prebuilt:
But, despite having installed LLVM 16:
I still get an error/using my system LLVM?
I'm not an LLVM expert so maybe there's something I forgot to install. In any case, my workaround was just to build from source:
Once that was done, I ended up w/ some issues w/
libstdc++.so.6
. This is because of the system and conda having different libs and seems to happen even if you install a fullgxx
orcxx-compiler
packages. I ended up fixing this with a bit of a hack:With this, all the TVM tests work.
After that, I compile a model per the directions. The only issue is that mlc_llm.build requires a
pip install pytest
or it dies:Finally for the chat. I would like to use the prebuilt:
but it complains about not having a {hip} option so I build my own. The instructions are fine assuming you didn't mess anything up (I had to blow away the conda venv a couple times since weird stuff seems to hang around if you screw anything up).
And that gets us to the error.
Expected behavior
It'd be nice if it inferenced! :)
Environment
conda
, source): sourcepip
, source): sourcepython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context
USE_ROCBLAS
orUS_MIOPEN
be used for the TVM build or is it not applicable for MLC LLM?