mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.14k stars 1.57k forks source link

[Bug] ROCm with AMD APU - Ryzen 7940HS / Radeon 780M #787

Closed lhl closed 1 year ago

lhl commented 1 year ago

🐛 Bug

After building a custom ROCm TVM, compiling a model, and building a custom mlc-chat-cli, I end up with and assert error when trying to inference:

./build/mlc_chat_cli --local-id meta-llama_Llama-2-7b-hf-q4f16_1 --evaluate --eval-gen-len 1920 --device rocm      (mlc) 
Use MLC config: "/home/lhl/mlc/mlc-llm/dist/meta-llama_Llama-2-7b-hf-q4f16_1/params/mlc-chat-config.json"
Use model weights: "/home/lhl/mlc/mlc-llm/dist/meta-llama_Llama-2-7b-hf-q4f16_1/params/ndarray-cache.json"
Use model library: "/home/lhl/mlc/mlc-llm/dist/meta-llama_Llama-2-7b-hf-q4f16_1/meta-llama_Llama-2-7b-hf-q4f16_1-rocm.so"
[18:43:39] /home/lhl/mlc/mlc-llm/3rdparty/tvm/src/runtime/library_module.cc:87: Assert fail: T.Cast("int32", split_rotary_cos_h_shape[0]) == -1, Argument split_rotary.cos_h.shape[0] has an unsatisfied constraint: -1 == T.Cast("int32", split_rotary_cos_h_shape[0])
Stack trace:
  [bt] (0) /home/lhl/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f40e7aef6d7]
  [bt] (1) ./build/mlc_chat_cli(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x37) [0x564a7d8c75a7]
  [bt] (2) /home/lhl/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0xeb9c4) [0x7f40e7aeb9c4]
  [bt] (3) /home/lhl/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0xebb90) [0x7f40e7aebb90]
  [bt] (4) /home/lhl/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x13c9f2) [0x7f40e7b3c9f2]
  [bt] (5) /home/lhl/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x930) [0x7f40e7b78f40]
  [bt] (6) /home/lhl/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x2e7) [0x7f40e7b758a7]
  [bt] (7) /home/lhl/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x25d) [0x7f40e7b75dad]
  [bt] (8) /home/lhl/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x1761f5) [0x7f40e7b761f5]

This might be a TVM issue? I am using ROCm 5.6 and using HSA_OVERRIDE_GFX_VERSION=11.0.0 (The Radeon 780M is gfx1103 / gfx1103_r1) so it could be a ROCm issue, although I was able to get ExLlama running...

Some related bugs:

To Reproduce

I am using ROCm 5.6 on Arch (installed via rocm-hip-sdk 5.6.0-1).

I have a clean conda env:

conda create -n mlc
conda activate mlc
mamba install pip
conda env config vars set HSA_OVERRIDE_GFX_VERSION=11.0.0
conda activate mlc

For TVM w/ ROCm, it'd be nice if we could use the prebuilt:

pip install --pre --force-reinstall mlc-ai-nightly-rocm mlc-chat-nightly-rocm -f https://mlc.ai/wheels

But, despite having installed LLVM 16:

mamba install llvmdev
llvm-config --version                                                                                                     16.0.6

I still get an error/using my system LLVM?

  1: tvm::codegen::LLVMInstance::ParseBuffer(llvm::MemoryBuffer const&) const
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/target/llvm/llvm_instance.cc", line 139
TVMError: /opt/rocm/amdgcn/bitcode/ocml.bc: error: Unknown attribute kind (86) (Producer: 'LLVM16.0.0git' Reader: 'LLVM 15.0.7')

I'm not an LLVM expert so maybe there's something I forgot to install. In any case, my workaround was just to build from source:

git clone --recursive git@github.com:mlc-ai/relax.git tvm-unity && cd tvm-unity
rm -rf build && mkdir build && cd build
cp ../cmake/config.cmake .
echo "set(CMAKE_BUILD_TYPE RelWithDebInfo)" >> config.cmake
echo "set(USE_LLVM \"llvm-config --ignore-libllvm --link-static\")" >> config.cmake
echo "set(HIDE_PRIVATE_SYMBOLS ON)" >> config.cmake
echo "set(USE_ROCM ON)" >> config.cmake
cmake .. && cmake --build . --parallel $(nproc)

Once that was done, I ended up w/ some issues w/ libstdc++.so.6. This is because of the system and conda having different libs and seems to happen even if you install a full gxx or cxx-compiler packages. I ended up fixing this with a bit of a hack:

cd $CONDA_PREFIX/lib
rm libstdc++.so.6
ln -s /usr/lib/libstdc++.so.6

With this, all the TVM tests work.

After that, I compile a model per the directions. The only issue is that mlc_llm.build requires a pip install pytest or it dies:

pip install pytest
python -m mlc_llm.build --model /data/ai/models/llm/meta-llama_Llama-2-7b-hf --target rocm --quantization q4f16_1

Finally for the chat. I would like to use the prebuilt:

mamba install -c mlc-ai -c conda-forge mlc-chat-cli-nightly

but it complains about not having a {hip} option so I build my own. The instructions are fine assuming you didn't mess anything up (I had to blow away the conda venv a couple times since weird stuff seems to hang around if you screw anything up).

And that gets us to the error.

Expected behavior

It'd be nice if it inferenced! :)

Environment

Additional context

Beomi commented 1 year ago

Similar issue at CUDA(11.8), RTX3090.

Convert was successful, but mlc_chat does not work.

Here's full log:

(mlc-llm) w3090 :: ~/coding/mlc-llm » python3 -m mlc_chat.rest --model dist/Llama-2-ko-7b-Chat-q4f16_1/params --lib-path dist/Llama-2-ko-7b-Chat-q4f16_1/Llama-2-ko-7b-Chat-q4f16_1-cuda.so --host 0.0.0.0 
INFO:     Started server process [1238047]
INFO:     Waiting for application startup.
System automatically detected device: cuda
Using model folder: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/params
Using mlc chat config: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/params/mlc-chat-config.json
Using library model: dist/Llama-2-ko-7b-Chat-q4f16_1/Llama-2-ko-7b-Chat-q4f16_1-cuda.so
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/mlc_chat/rest.py", line 186, in request_completion
    msg = session["chat_mod"].generate(prompt=request.messages[0].content)
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 657, in generate
    self._decode()
  File "/home/beomi/anaconda3/envs/mlc-llm/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 859, in _decode
    self._decode_func()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 262, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 251, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
tvm._ffi.base.TVMError: Traceback (most recent call last):
  10: TVMFuncCall
  9: mlc::llm::LLMChat::DecodeStep()
        at /workspace/mlc-llm/cpp/llm_chat.cc:638
  8: mlc::llm::LLMChat::ForwardTokens(std::vector<int, std::allocator<int> >, long)
        at /workspace/mlc-llm/cpp/llm_chat.cc:843
  7: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  6: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  5: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)
  4: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
  3: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
  2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::__mk_TVM11::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::relax_vm::__mk_TVM11, tvm::runtime::TVMRetValue)
  1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/runtime/library_module.cc", line 87
TVMError: Assert fail: T.Cast("int32", split_rotary_cos_h_shape[0]) == -1, Argument split_rotary.cos_h.shape[0] has an unsatisfied constraint: -1 == T.Cast("int32", split_rotary_cos_h_shape[0])
lsr0 commented 1 year ago

I experienced the same problem with vicuna-13b-v1.5-q4f16_1, vicuna-13b-v1.3-q4f16_1, vicuna-13b-v1.1-q4f16_1, and vicuna-7b-v1.5-q4f16_1 but not vicuna-7b-v1.3-q4f16_1. Just to clarify: also using ROCm, but a 7900XT.

peacepenguin commented 1 year ago

I compiled tvm-unity with rocm support for gfx1030 (6800 xt gpu).

I get the exact same error when trying to use mlc-llm chat with the llama2 models.

Hzfengsy commented 1 year ago

Thanks for the reporting, should be fixed in https://github.com/mlc-ai/mlc-llm/pull/797

tqchen commented 1 year ago

Thanks for reporting glad that it get resolved

bod2000 commented 1 year ago

@lhl did you try to run this fix on Ryzen 7940HS / Radeon 780M? Does it work?

lhl commented 1 year ago

@lhl did you try to run this fix on Ryzen 7940HS / Radeon 780M? Does it work?

Yes, I was able to get it working. From my testing back in September, the ROCm version was 5X faster than CPU and ~4X faster than CLBlast for prefill, but still just bandwidth limited at batch=1. Results of benchmark testing here: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1041125589

The biggest limitation of the ROCm version is it can only use GART VRAM, although there may be workarounds w/ a PyTorch GTT allocator, I haven't tried. Some notes I made here: https://llm-tracker.info/books/howto-guides/page/amd-gpus#bkmrk-amd-apu

SuperGoodGame commented 6 months ago

@lhl Can you tell me how to make pytorch recognize 780m correctly? My rocm cannot find my 780m after installation