mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.08k stars 1.56k forks source link

[Bug] InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 221184 at offset 0 in region that is 163840bytes #1327

Closed IvoryTower800 closed 11 months ago

IvoryTower800 commented 11 months ago

🐛 Bug

I use chinese-alpaca-2-7b-16k as base model, fintuned a lora model. Then I merged it. The model's outputs are fine when using torch transformers.

However, when I use the following code to convert it,

python3 -m mlc_llm.build --model /hy-tmp/flag/story-7b-16k --target cuda --use-safetensors --max-seq-len 2048

the model can only be loaded, but not generate.

from mlc_chat import ChatModule from mlc_chat.callback import StreamToStdout cm = ChatModule(model="wrter-7b-hypothesis-q4f16_1") output = cm.generate( prompt="What is the meaning of life?", progress_callback=StreamToStdout(callback_interval=2), ) print(f"Statistics: {cm.stats()}\n")

Traceback (most recent call last): File "chats.py", line 5, in <module> output = cm.generate( File "/usr/local/miniconda3/lib/python3.8/site-packages/mlc_chat/chat_module.py", line 775, in generate self._prefill(prompt, generation_config=generation_config) File "/usr/local/miniconda3/lib/python3.8/site-packages/mlc_chat/chat_module.py", line 992, in _prefill self._prefill_func( File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__ File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/usr/local/miniconda3/lib/python3.8/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1492, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const File "/workspace/mlc-llm/cpp/llm_chat.cc", line 858, in mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, mlc::llm::PlaceInPrompt, tvm::runtime::String) File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1088, in mlc::llm::LLMChat::SampleTokenFromLogits(tvm::runtime::NDArray, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, picojson::value, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, picojson::value> > >) File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1263, in mlc::llm::LLMChat::Softmax(tvm::runtime::NDArray, tvm::runtime::NDArray) tvm.error.InternalError: Traceback (most recent call last): 12: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const at /workspace/mlc-llm/cpp/llm_chat.cc:1492 11: mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, mlc::llm::PlaceInPrompt, tvm::runtime::String) at /workspace/mlc-llm/cpp/llm_chat.cc:858 10: mlc::llm::LLMChat::SampleTokenFromLogits(tvm::runtime::NDArray, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, picojson::value, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, picojson::value> > >) at /workspace/mlc-llm/cpp/llm_chat.cc:1088 9: mlc::llm::LLMChat::Softmax(tvm::runtime::NDArray, tvm::runtime::NDArray) at /workspace/mlc-llm/cpp/llm_chat.cc:1263 8: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 7: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&) 5: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop() 4: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction) 3: _ZN3tvm7runtime13PackedFun 2: tvm::runtime::TypedPackedFunc<tvm::runtime::NDArray (tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const 1: tvm::runtime::memory::StorageObj::AllocNDArray(long, tvm::runtime::ShapeTuple, DLDataType) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/memory/memory_manager.cc", line 108 InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 221184 at offset 0 in region that is 163840bytes

I tried with different max-seq-len. it returns the same.

What should I do? Thanks.

Environment

MasterJH5574 commented 11 months ago

Hi @IvoryTower800, thanks for reporting. The issue is because the vocabulary size of the Chinese Alpaca model (55296) is larger than the default max vocabulary size (40000) in MLC LLM. We will fix this issue later on. Meanwhile, as a quicker action to get around this, you can use the command

python3 -m mlc_llm.build --model /hy-tmp/flag/story-7b-16k --target cuda --use-safetensors --max-seq-len 2048 --max-vocab-size 55296

so that the max vocabulary size is explicitly set.

Please let me know if this can help, thank you!

IvoryTower800 commented 11 months ago

Yes, thank you. When I added --max-vocab-size 55296, the generate function does not return any error!

However, the output is repeated ":" like below. (base) root@I166abc628200301838:/hy-tmp/flag# python chats.py ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::^CTraceback (most recent call last):

What should I do to solve this issue? Thanks!

IvoryTower800 commented 11 months ago

@MasterJH5574 Hi, I tried original Chinese-alpaca-2-7b and Chinese-alpaca-2-7b-16k model.

This issue only exists for 16k models. the 4k models' outputs are normal.

Are there any parameters I can set to solve this problem?

MasterJH5574 commented 11 months ago

Hi @IvoryTower800, sorry for the delayed response! I checked a bit. Looks to me that the issue is because the 16k model uses RoPE scaling 4.0 (https://huggingface.co/hfl/chinese-alpaca-2-7b-16k/blob/main/config.json#L19-L22) while the 4k model does not (https://huggingface.co/hfl/chinese-alpaca-2-7b/blob/main/config.json).

Right now we have not yet supported RoPE scaling. I'll put this into the team todo list, though it may take a while to add the support.

IvoryTower800 commented 11 months ago

@MasterJH5574 Thank you for your reply. This project is an amazing work. RoPE is an important feature for many users. Personally speaking, if RoPE were supported, mlc-llm is infinitely close to perfect. I really wish it could be done soon. Thanks.

MasterJH5574 commented 11 months ago

Thank you @IvoryTower800, I totally agree with you that RoPE scaling is a very important feature. We have created the tracking issue regarding the progress https://github.com/mlc-ai/mlc-llm/issues/1344.

MasterJH5574 commented 11 months ago

Gonna merge this PR as the original issue is addressed. We will follow up with the improvements including better error reporting regarding the maximum vocab size.

MrJungle1 commented 11 months ago

@MasterJH5574 Hello, I would like to ask, I also use chinese-alpaca, when I use mlc_llm.build to convert it

python -m mlc_llm.build --model Desktop/mlc-llm-main/dist/models/x3 --target iphone --quantization q3f16_1 --max-seq-len 768

pass in --quantization q0f16 cannot run on the iPhone because it requires 6.39g of memory for inference, q8f16_1 normally only requires 1.72 g, q4f16_1 is abnormal and requires 4.38g, q3f16_1 is normal and only requires 893.5M