vectorch-ai / ScaleLLM

A high-performance inference system for large language models, designed for production environments.
https://docs.vectorch.com/
Apache License 2.0
384 stars 29 forks source link

Mistral large GPTQ model inference problem #308

Closed drdaliang closed 2 weeks ago

drdaliang commented 2 months ago

Hello,

I am using ScaleLLM to inference Mistral-Large-Instruct-2407-GPTQ model, and got all commas as the output. like this:

You are a helpful assistant. You can help me by answering my questions. You can also ask me questions. word count: 19, token count: 29 hi word count: 1, token count: 8 , , , ,, word count: 0, token count: 15, tokens used: 46, model: mistral-large-latest(Mistral-Large-Instruct-2407-GPTQ)

I have successfully run the same local model with vllm and sglang. I got the model from this url:

https://huggingface.co/TechxGenus/Mistral-Large-Instruct-2407-GPTQ

logs when running the model:

WARNING: Logging before InitGoogleLogging() is written to STDERR I20240814 22:35:40.531607 162675 llm_handler.cpp:171] Creating engine with devices: cuda:0,cuda:1,cuda:2,cuda:3 W20240814 22:35:43.234417 162675 model_loader.cpp:301] Overwriting dtype from bfloat16 to float16 for quantization I20240814 22:35:43.234916 162675 llm_engine.cpp:138] Initializing model from: /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ I20240814 22:35:43.234936 162675 model_loader.cpp:172] Using fast tokenizer. I20240814 22:35:43.281715 162675 llm_engine.cpp:156] Block info, block_size: 16, n_local_kv_heads: 2, head_dim: 128, n_layers: 88, dtype: Half I20240814 22:35:43.283596 162675 llm_engine.cpp:175] Initializing model with ModelArgs: [model_type: mistral, dtype: float16, hidden_size: 12288, hidden_act: silu, intermediate_size: 28672, n_layers: 88, head_dim: 128, n_heads: 96, n_kv_heads: 8, vocab_size: 32768, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 1e+06, rope_scaling_rope_type: , rope_scaling_factor: 0, rope_scaling_low_freq_factor: 0, rope_scaling_high_freq_factor: 0, rope_scaling_original_max_position_embeddings: 0, rotary_pct: 1, max_position_embeddings: 32768, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, linear_bias: 0, qkv_bias: 0, residual_post_layernorm: 0] I20240814 22:35:43.283625 162675 llm_engine.cpp:176] Initializing model with quant args: QuantArgs: [quant_method: gptq, bits: 4, group_size: 128, desc_act: 1, true_sequential: 1] I20240814 22:35:43.283633 162675 llm_engine.cpp:177] Initializing model with tokenizer args: TokenizerArgs: [tokenizer_type: sentencepiece, vocab_file: tokenizer.model, pattern: ] I20240814 22:35:43.615725 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00001-of-00014.safetensors I20240814 22:35:44.853797 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00 002-of-00014.safetensors I20240814 22:35:45.887975 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00003-of-00014.safetensors I20240814 22:35:46.955483 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00004-of-00014.safetensors I20240814 22:35:48.005149 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00005-of-00014.safetensors I20240814 22:35:49.047144 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00006-of-00014.safetensors I20240814 22:35:50.051784 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00007-of-00014.safetensors I20240814 22:35:51.050997 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00008-of-00014.safetensors I20240814 22:35:52.106201 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00009-of-00014.safetensors I20240814 22:35:53.120800 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00010-of-00014.safetensors I20240814 22:35:54.188920 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00011-of-00014.safetensors I20240814 22:35:55.185894 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00012-of-00014.safetensors I20240814 22:35:56.136348 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00013-of-00014.safetensors I20240814 22:35:57.192777 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00014-of-00014.safetensors I20240814 22:35:57.555701 162675 llm_engine.cpp:305] cuda:0: available memory: 5.12 GB, total memory: 23.55 GB I20240814 22:35:57.555797 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B I20240814 22:35:57.555825 162675 llm_engine.cpp:305] cuda:1: available memory: 5.12 GB, total memory: 23.55 GB I20240814 22:35:57.555845 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B I20240814 22:35:57.555866 162675 llm_engine.cpp:305] cuda:2: available memory: 5.12 GB, total memory: 23.55 GB I20240814 22:35:57.555889 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B I20240814 22:35:57.555910 162675 llm_engine.cpp:305] cuda:3: available memory: 5.12 GB, total memory: 23.55 GB I20240814 22:35:57.555932 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B I20240814 22:35:57.555955 162675 llm_engine.cpp:122] Initializing kv cache with size: 2.76 GB I20240814 22:35:57.555976 162675 llm_engine.cpp:333] Initializing kv cache with shape: [2056 16 2 128] I20240814 22:35:57.666657 162675 llm_engine.cpp:236] Capturing CUDA graphs: num_decoding_tokens: 1, batch sizes: 1 2 4 8 16 24 32 48 64 I20240814 22:36:00.271500 162675 llm_handler.cpp:224] Using default chat template for model type: mistral INFO: Started server process [162675] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit) INFO: 172.24.21.213:53482 - "POST /v1/chat/completions HTTP/1.1" 200 OK

what can be the problem?

guocuimi commented 2 months ago

Thanks for reporting the issue. Looking into it.

guocuimi commented 2 months ago

The issue has been successfully reproduced on my end. Two potential issues identified.

Fix is on the way. you should be able to get a workable version in next release. (ETA: end of this week).

guocuimi commented 2 months ago

Juat a quick update: A new Marlin kernel with desc_act support has been landed. However, additional time is needed to thoroughly test scenarios where tp > 1. The work involved is more extensive than initially anticipated. The new ETA for release is 08/21.