[Bug] InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 221184 at offset 0 in region that is 163840bytes

IvoryTower800 commented 11 months ago

🐛 Bug

I use chinese-alpaca-2-7b-16k as base model, fintuned a lora model. Then I merged it. The model's outputs are fine when using torch transformers.

However, when I use the following code to convert it,

python3 -m mlc_llm.build --model /hy-tmp/flag/story-7b-16k --target cuda --use-safetensors --max-seq-len 2048

the model can only be loaded, but not generate.

from mlc_chat import ChatModule from mlc_chat.callback import StreamToStdout cm = ChatModule(model="wrter-7b-hypothesis-q4f16_1") output = cm.generate( prompt="What is the meaning of life?", progress_callback=StreamToStdout(callback_interval=2), ) print(f"Statistics: {cm.stats()}\n")

Traceback (most recent call last): File "chats.py", line 5, in <module> output = cm.generate( File "/usr/local/miniconda3/lib/python3.8/site-packages/mlc_chat/chat_module.py", line 775, in generate self._prefill(prompt, generation_config=generation_config) File "/usr/local/miniconda3/lib/python3.8/site-packages/mlc_chat/chat_module.py", line 992, in _prefill self._prefill_func( File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__ File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/usr/local/miniconda3/lib/python3.8/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1492, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const File "/workspace/mlc-llm/cpp/llm_chat.cc", line 858, in mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, mlc::llm::PlaceInPrompt, tvm::runtime::String) File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1088, in mlc::llm::LLMChat::SampleTokenFromLogits(tvm::runtime::NDArray, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, picojson::value, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, picojson::value> > >) File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1263, in mlc::llm::LLMChat::Softmax(tvm::runtime::NDArray, tvm::runtime::NDArray) tvm.error.InternalError: Traceback (most recent call last): 12: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const at /workspace/mlc-llm/cpp/llm_chat.cc:1492 11: mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, mlc::llm::PlaceInPrompt, tvm::runtime::String) at /workspace/mlc-llm/cpp/llm_chat.cc:858 10: mlc::llm::LLMChat::SampleTokenFromLogits(tvm::runtime::NDArray, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, picojson::value, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, picojson::value> > >) at /workspace/mlc-llm/cpp/llm_chat.cc:1088 9: mlc::llm::LLMChat::Softmax(tvm::runtime::NDArray, tvm::runtime::NDArray) at /workspace/mlc-llm/cpp/llm_chat.cc:1263 8: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 7: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&) 5: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop() 4: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction) 3: _ZN3tvm7runtime13PackedFun 2: tvm::runtime::TypedPackedFunc<tvm::runtime::NDArray (tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const 1: tvm::runtime::memory::StorageObj::AllocNDArray(long, tvm::runtime::ShapeTuple, DLDataType) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/memory/memory_manager.cc", line 108 InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 221184 at offset 0 in region that is 163840bytes

I tried with different max-seq-len. it returns the same.

What should I do? Thanks.

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): UBUNTU
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): RTX 4090
How you installed MLC-LLM (conda, source): source
How you installed TVM-Unity (pip, source): source
Python version (e.g. 3.10): 3.10
GPU driver version (if applicable): 535.104.12
CUDA/cuDNN version (if applicable): cu118
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): USE_NVTX: OFF USE_GTEST: AUTO SUMMARIZE: OFF USE_IOS_RPC: OFF USE_MSC: OFF USE_ETHOSU: CUDA_VERSION: 11.8 USE_LIBBACKTRACE: AUTO DLPACK_PATH: 3rdparty/dlpack/include USE_TENSORRT_CODEGEN: OFF USE_THRUST: OFF USE_TARGET_ONNX: OFF USE_AOT_EXECUTOR: ON BUILD_DUMMY_LIBTVM: OFF USE_CUDNN: OFF USE_TENSORRT_RUNTIME: OFF USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF USE_CCACHE: AUTO USE_ARM_COMPUTE_LIB: OFF USE_CPP_RTVM: USE_OPENCL_GTEST: /path/to/opencl/gtest USE_MKL: OFF USE_PT_TVMDSOOP: OFF MLIR_VERSION: NOT-FOUND USE_CLML: OFF USE_STACKVM_RUNTIME: OFF USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF ROCM_PATH: /opt/rocm USE_DNNL: OFF USE_VITIS_AI: OFF USE_MLIR: OFF USE_RCCL: OFF USE_LLVM: llvm-config --ignore-libllvm --link-static USE_VERILATOR: OFF USE_TF_TVMDSOOP: OFF USE_THREADS: ON USE_MSVC_MT: OFF BACKTRACE_ON_SEGFAULT: OFF USE_GRAPH_EXECUTOR: ON USE_NCCL: ON USE_ROCBLAS: OFF GIT_COMMIT_HASH: 119fce9877001d217f3b29293bd4a344897b87ff USE_VULKAN: ON USE_RUST_EXT: OFF USE_CUTLASS: ON USE_CPP_RPC: OFF USE_HEXAGON: OFF USE_CUSTOM_LOGGING: OFF USE_UMA: OFF USE_FALLBACK_STL_MAP: OFF USE_SORT: ON USE_RTTI: ON GIT_COMMIT_TIME: 2023-11-21 10:46:02 -0800 USE_HEXAGON_SDK: /path/to/sdk USE_BLAS: none USE_ETHOSN: OFF USE_LIBTORCH: OFF USE_RANDOM: ON USE_CUDA: ON USE_COREML: OFF USE_AMX: OFF BUILD_STATIC_RUNTIME: OFF USE_CMSISNN: OFF USE_KHRONOS_SPIRV: OFF USE_CLML_GRAPH_EXECUTOR: OFF USE_TFLITE: OFF USE_HEXAGON_GTEST: /path/to/hexagon/gtest PICOJSON_PATH: 3rdparty/picojson USE_OPENCL_ENABLE_HOST_PTR: OFF INSTALL_DEV: OFF USE_PROFILER: ON USE_NNPACK: OFF LLVM_VERSION: 15.0.7 USE_OPENCL: OFF COMPILER_RT_PATH: 3rdparty/compiler-rt RANG_PATH: 3rdparty/rang/include USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF USE_OPENMP: OFF USE_BNNS: OFF USE_CUBLAS: OFF USE_METAL: OFF USE_MICRO_STANDALONE_RUNTIME: OFF USE_HEXAGON_EXTERNAL_LIBS: OFF USE_ALTERNATIVE_LINKER: AUTO USE_BYODT_POSIT: OFF USE_HEXAGON_RPC: OFF USE_MICRO: OFF DMLC_PATH: 3rdparty/dmlc-core/include INDEX_DEFAULT_I64: ON USE_RELAY_DEBUG: OFF USE_RPC: ON USE_TENSORFLOW_PATH: none TVM_CLML_VERSION: USE_MIOPEN: OFF USE_ROCM: OFF USE_PAPI: OFF USE_CURAND: OFF TVM_CXX_COMPILER_PATH: /opt/rh/gcc-toolset-11/root/usr/bin/c++ HIDE_PRIVATE_SYMBOLS: ON

MasterJH5574 commented 11 months ago

Hi @IvoryTower800, thanks for reporting. The issue is because the vocabulary size of the Chinese Alpaca model (55296) is larger than the default max vocabulary size (40000) in MLC LLM. We will fix this issue later on. Meanwhile, as a quicker action to get around this, you can use the command

python3 -m mlc_llm.build --model /hy-tmp/flag/story-7b-16k --target cuda --use-safetensors --max-seq-len 2048 --max-vocab-size 55296

so that the max vocabulary size is explicitly set.

Please let me know if this can help, thank you!

IvoryTower800 commented 11 months ago

Yes, thank you. When I added --max-vocab-size 55296, the generate function does not return any error!

However, the output is repeated ":" like below. (base) root@I166abc628200301838:/hy-tmp/flag# python chats.py ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::^CTraceback (most recent call last):

What should I do to solve this issue? Thanks!

IvoryTower800 commented 11 months ago

@MasterJH5574 Hi, I tried original Chinese-alpaca-2-7b and Chinese-alpaca-2-7b-16k model.

This issue only exists for 16k models. the 4k models' outputs are normal.

Are there any parameters I can set to solve this problem?

MasterJH5574 commented 11 months ago

Hi @IvoryTower800, sorry for the delayed response! I checked a bit. Looks to me that the issue is because the 16k model uses RoPE scaling 4.0 (https://huggingface.co/hfl/chinese-alpaca-2-7b-16k/blob/main/config.json#L19-L22) while the 4k model does not (https://huggingface.co/hfl/chinese-alpaca-2-7b/blob/main/config.json).

Right now we have not yet supported RoPE scaling. I'll put this into the team todo list, though it may take a while to add the support.

IvoryTower800 commented 11 months ago

@MasterJH5574 Thank you for your reply. This project is an amazing work. RoPE is an important feature for many users. Personally speaking, if RoPE were supported, mlc-llm is infinitely close to perfect. I really wish it could be done soon. Thanks.

MasterJH5574 commented 11 months ago

Thank you @IvoryTower800, I totally agree with you that RoPE scaling is a very important feature. We have created the tracking issue regarding the progress https://github.com/mlc-ai/mlc-llm/issues/1344.

MasterJH5574 commented 11 months ago

Gonna merge this PR as the original issue is addressed. We will follow up with the improvements including better error reporting regarding the maximum vocab size.

MrJungle1 commented 11 months ago

@MasterJH5574 Hello, I would like to ask, I also use chinese-alpaca, when I use mlc_llm.build to convert it

python -m mlc_llm.build --model Desktop/mlc-llm-main/dist/models/x3 --target iphone --quantization q3f16_1 --max-seq-len 768

pass in --quantization q0f16 cannot run on the iPhone because it requires 6.39g of memory for inference, q8f16_1 normally only requires 1.72 g, q4f16_1 is abnormal and requires 4.38g, q3f16_1 is normal and only requires 893.5M

mlc-ai / mlc-llm

[Bug] InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 221184 at offset 0 in region that is 163840bytes #1327

🐛 Bug

Environment