mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.8k stars 1.54k forks source link

[Bug] chatglm4 mlc_llm shows error "TVMError: Check failed: append_length > 0 (0 vs. 0) : Append with length 0 is not allowed." during mlc_llm chat CLI #2517

Open lihaofd opened 4 months ago

lihaofd commented 4 months ago

mlc-ai-nightly-cu122 0.15.dev404 mlc-llm-nightly-cu122 0.1.dev1355 transformers 4.41.2

git clone https://huggingface.co/THUDM/glm-4-9b-chat mlc_llm convert_weight ./dist/models/glm-4-9b-chat/ --quantization q4f16_1 -o dist/glm-4-9b-chat-MLC

mlc_llm gen_config ./dist/models/glm-4-9b-chat/ --quantization q4f16_1 --conv-template glm -o dist/glm-4-9b-chat-MLC/ It shows The repository for dist/models/glm-4-9b-chat contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/dist/models/glm-4-9b-chat. You can avoid this prompt in future by passing the argument trust_remote_code=True.

Do you wish to run the custom code? [y/N] y Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Add trust_remote_code=True fast_tokenizer = AutoTokenizer.from_pretrained(str(config.parent), use_fast=True, trust_remote_code=True)

It shows error AttributeError: 'ChatGLM4Tokenizer' object has no attribute 'backend_tokenizer' /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:154: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used. Segmentation fault (core dumped)

mlc_chat_config.tokenizer_info = asdict(Tokenizer.detect_tokenizer_info(str(output))) run into Segmentation fault

Hzfengsy commented 4 months ago

I'm not sure but GLM may use a customized tokenizer which is not supported yet

lihaofd commented 4 months ago

I'm not sure but GLM may use a customized tokenizer which is not supported yet

https://github.com/mlc-ai/mlc-llm/pull/1313 mentioned chatglm3 back, but I tried chatglm3-6b, it show same error

Ubospica commented 3 months ago

That is related to a recent change in tokenizer in #2416. We will fix that soon

Ubospica commented 3 months ago

See #2532

lihaofd commented 3 months ago

@Ubospica thanks! I just tested latest package mlc-ai-nightly-cu122 0.15.dev404 mlc-llm-nightly-cu122 0.1.dev1382

mlc_llm gen_config ./dist/models/glm-4-9b-chat/ --quantization q4f16_1 --conv-template glm -o dist/glm-4-9b-chat-MLC/ works now But after compilation mlc_llm compile ./dist/glm-4-9b-chat-MLC/mlc-chat-config.json --device cuda --quantization q4f16_1 --model-type chatglm --output ./dist/libs/glm-4-9b-chat/glm-4-9b-chat-cuda.so

If running with cli mlc_llm chat ./dist/glm-4-9b-chat-MLC/ --device "cuda" --model-lib dist/libs/glm-4-9b-chat/glm-4-9b-chat-cuda.so

It shows error like below

mlc_llm chat ./dist/glm-4-9b-chat-MLC/ --device "cuda" --model-lib dist/libs/glm-4-9b-chat/glm-4-9b-chat-cuda.so [2024-06-08 07:04:34] INFO auto_device.py:79: Found device: cuda:0 [2024-06-08 07:04:34] INFO engine_base.py:143: Using library model: dist/libs/glm-4-9b-chat/glm-4-9b-chat-cuda.so [07:04:34] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:154: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used. [07:04:34] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:130: Warning: Using tokenizer.model since we cannot locate tokenizer.json. It is recommended to use tokenizer.json to ensure all token mappings are included, since currently, files like added_tokens.json, tokenizer_config.json are ignored. Consider converting tokenizer.model to tokenizer.json by compiling the model with MLC again, or see if MLC's huggingface provides this file. [07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:649: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048. [07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:649: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 131072, prefill chunk size will be set to 2048. [07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:649: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 314995, prefill chunk size will be set to 2048. [07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:729: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 131072, prefill chunk size is 2048. [07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:734: Estimated total single GPU memory usage: 13215.253 MB (Parameters: 5043.234 MB. KVCache: 5202.672 MB. Temporary buffer: 2969.346 MB). The actual usage might be slightly larger than the estimated number. [07:04:36] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:154: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used. [07:04:36] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:130: Warning: Using tokenizer.model since we cannot locate tokenizer.json. It is recommended to use tokenizer.json to ensure all token mappings are included, since currently, files like added_tokens.json, tokenizer_config.json are ignored. Consider converting tokenizer.model to tokenizer.json by compiling the model with MLC again, or see if MLC's huggingface provides this file. sentencepiece_processor.cc(922) LOG(ERROR) 3rdparty/tokenizers-cpp/sentencepiece/src/sentencepieceprocessor.cc(289) [model] Model is not initialized. Returns default value 0 [07:04:36] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:154: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used. [07:04:36] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:130: Warning: Using tokenizer.model since we cannot locate tokenizer.json. It is recommended to use tokenizer.json to ensure all token mappings are included, since currently, files like added_tokens.json, tokenizer_config.json are ignored. Consider converting tokenizer.model to tokenizer.json by compiling the model with MLC again, or see if MLC's huggingface provides this file. You can use the following special commands: /help print the special commands /exit quit the cli /stats print out stats of last request (token/sec) /metrics print out full engine metrics /reset restart a fresh chat /set [overrides] override settings in the generation config. For example, /set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2 Note: Separate stop words in the stop option with commas (,). Multi-line input: Use escape+enter to start a new line.

Exception in thread Thread-1: Traceback (most recent call last): File "/home/haoli/anaconda3.11-GPU_new_mlc2/lib/python3.11/threading.py", line 1038, in _bootstrap_inner self.run() File "/home/haoli/anaconda3.11-GPU_new_mlc2/lib/python3.11/threading.py", line 975, in run self._target(*self._args, *self._kwargs) File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3 File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/home/haoli/anaconda3.11-GPU_new_mlc2/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "/workspace/mlc-llm/cpp/serve/threaded_engine.cc", line 182, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop() File "/workspace/mlc-llm/cpp/serve/engine.cc", line 619, in mlc::llm::serve::EngineImpl::Step() File "/workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc", line 116, in mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState) File "/workspace/mlc-llm/cpp/serve/model.cc", line 232, in mlc::llm::serve::ModelImpl::BatchPrefill(tvm::runtime::ObjectRef const&, std::vector<long, std::allocator > const&, std::vector<int, std::allocator > const&) tvm._ffi.base.TVMError: Traceback (most recent call last): 7: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop() at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:182 6: mlc::llm::serve::EngineImpl::Step() at /workspace/mlc-llm/cpp/serve/engine.cc:619 5: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState) at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:116 4: mlc::llm::serve::ModelImpl::BatchPrefill(tvm::runtime::ObjectRef const&, std::vector<long, std::allocator > const&, std::vector<int, std::allocator > const&) at /workspace/mlc-llm/cpp/serve/model.cc:232 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::__mk_TVM8::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::relax_vm::__mk_TVM8, tvm::runtime::TVMRetValue) 2: tvm::runtime::relax_vm::PagedAttentionKVCacheObj::BeginForward(tvm::runtime::ShapeTuple const&, tvm::runtime::ShapeTuple const&, tvm::runtime::Optional const&) 1: tvm::runtime::relax_vm::PagedAttentionKVCacheObj::ReserveAppendLengthInSeq(tvm::runtime::relax_vm::Sequence, long) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/relax_vm/paged_kv_cache.cc", line 2004 TVMError: Check failed: append_length > 0 (0 vs. 0) : Append with length 0 is not allowed.

MasterJH5574 commented 3 months ago

@lihaofd thanks for reporting. We'll look into the issue on glm 4. Meanwhile, would you mind confirming if this issue does not happen for other models like llama?

lihaofd commented 3 months ago

@MasterJH5574 I have tried chatglm3-6b, it will not show error "TVMError: Check failed: append_length > 0 (0 vs. 0) : Append with length 0 is not allowed.", but the output is abnormal like mlc_llm chat ./dist/chatglm3-6b-MLC/ --device "cuda" --model-lib dist/libs/chatglm3-6b/chatglm3-6b-cuda.so

please introduce shanghai

My name is xxx, and I am a school school

  1. I- q is a 在校学生 .进行的进行的

I am a language model,

And here's largest

No 'big' what The speech speech

懈口令公 quo I'm wrong

您傘

您遮 Aut兼任

quality

语言

lihaofd commented 3 months ago

@MasterJH5574 I also tried https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat seems like below CLI work with common output mlc_llm chat ./dist/Llama3-8B-Chinese-Chat-MLC/ --device "cuda" --model-lib ./dist/libs/Llama3-8B-Chinese-Chat/Llama3-8B-Chinese-Chat-q4f16_1-cuda.so

introduce shanghai Shanghai, the "Pearl of the Orient," is an iconic metropolis that effortlessly blends traditional Chinese culture and history with its modern, cosmopolitan flair. Located on the eastern coast of China, this vibrant city is one of the most populous and culturally significant urban centers in the world.

As part of the Yangtze River Delta, Shanghai has been an essential trading point for centuries, eventually emerging as a major political and economic hub in the People's Republic of China. It's home to a magnitude of historical, cultural, and architectural marvels that span across eras and styles – the modern skyscrapers of the stunning skyline aptly complemented by tranquil Chinese pavilions and gardens.

From world-renowned attractions like the Shanghai Tower and the iconic Bund, to the restored classical charm of the old French Concession and the beckoning call of sweet juicy sheng jian bing (crispy-skinned pancake) and a steaming pot of xiaolongbao (soup-filled dumplings), this city is a must-visit destination for anyone eager to explore the unique dynamism of contemporary China. With its people, food, and riveting history, Shanghai is sure to meet and, likely, exceed the expectations of curious and keen travelers.

MasterJH5574 commented 3 months ago

@lihaofd Thanks for sharing so much information. We'll look into this.

felixslu commented 1 month ago

internlm2 meet this error.

v0.9.dev0/mlc_ai_nightly_cu122-0.15.dev519-cp310-cp310-manylinux_2_28_x86_64.whl