Closed drdaliang closed 2 weeks ago
Thanks for reporting the issue. Looking into it.
The issue has been successfully reproduced on my end. Two potential issues identified.
Fix is on the way. you should be able to get a workable version in next release. (ETA: end of this week).
Juat a quick update: A new Marlin kernel with desc_act support has been landed. However, additional time is needed to thoroughly test scenarios where tp > 1. The work involved is more extensive than initially anticipated. The new ETA for release is 08/21.
Hello,
I am using ScaleLLM to inference Mistral-Large-Instruct-2407-GPTQ model, and got all commas as the output. like this:
You are a helpful assistant. You can help me by answering my questions. You can also ask me questions. word count: 19, token count: 29 hi word count: 1, token count: 8 , , , ,, word count: 0, token count: 15, tokens used: 46, model: mistral-large-latest(Mistral-Large-Instruct-2407-GPTQ)
I have successfully run the same local model with vllm and sglang. I got the model from this url:
https://huggingface.co/TechxGenus/Mistral-Large-Instruct-2407-GPTQ
logs when running the model:
WARNING: Logging before InitGoogleLogging() is written to STDERR I20240814 22:35:40.531607 162675 llm_handler.cpp:171] Creating engine with devices: cuda:0,cuda:1,cuda:2,cuda:3 W20240814 22:35:43.234417 162675 model_loader.cpp:301] Overwriting dtype from bfloat16 to float16 for quantization I20240814 22:35:43.234916 162675 llm_engine.cpp:138] Initializing model from: /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ I20240814 22:35:43.234936 162675 model_loader.cpp:172] Using fast tokenizer. I20240814 22:35:43.281715 162675 llm_engine.cpp:156] Block info, block_size: 16, n_local_kv_heads: 2, head_dim: 128, n_layers: 88, dtype: Half I20240814 22:35:43.283596 162675 llm_engine.cpp:175] Initializing model with ModelArgs: [model_type: mistral, dtype: float16, hidden_size: 12288, hidden_act: silu, intermediate_size: 28672, n_layers: 88, head_dim: 128, n_heads: 96, n_kv_heads: 8, vocab_size: 32768, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 1e+06, rope_scaling_rope_type: , rope_scaling_factor: 0, rope_scaling_low_freq_factor: 0, rope_scaling_high_freq_factor: 0, rope_scaling_original_max_position_embeddings: 0, rotary_pct: 1, max_position_embeddings: 32768, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, linear_bias: 0, qkv_bias: 0, residual_post_layernorm: 0] I20240814 22:35:43.283625 162675 llm_engine.cpp:176] Initializing model with quant args: QuantArgs: [quant_method: gptq, bits: 4, group_size: 128, desc_act: 1, true_sequential: 1] I20240814 22:35:43.283633 162675 llm_engine.cpp:177] Initializing model with tokenizer args: TokenizerArgs: [tokenizer_type: sentencepiece, vocab_file: tokenizer.model, pattern: ] I20240814 22:35:43.615725 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00001-of-00014.safetensors I20240814 22:35:44.853797 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00 002-of-00014.safetensors I20240814 22:35:45.887975 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00003-of-00014.safetensors I20240814 22:35:46.955483 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00004-of-00014.safetensors I20240814 22:35:48.005149 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00005-of-00014.safetensors I20240814 22:35:49.047144 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00006-of-00014.safetensors I20240814 22:35:50.051784 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00007-of-00014.safetensors I20240814 22:35:51.050997 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00008-of-00014.safetensors I20240814 22:35:52.106201 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00009-of-00014.safetensors I20240814 22:35:53.120800 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00010-of-00014.safetensors I20240814 22:35:54.188920 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00011-of-00014.safetensors I20240814 22:35:55.185894 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00012-of-00014.safetensors I20240814 22:35:56.136348 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00013-of-00014.safetensors I20240814 22:35:57.192777 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00014-of-00014.safetensors I20240814 22:35:57.555701 162675 llm_engine.cpp:305] cuda:0: available memory: 5.12 GB, total memory: 23.55 GB I20240814 22:35:57.555797 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B I20240814 22:35:57.555825 162675 llm_engine.cpp:305] cuda:1: available memory: 5.12 GB, total memory: 23.55 GB I20240814 22:35:57.555845 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B I20240814 22:35:57.555866 162675 llm_engine.cpp:305] cuda:2: available memory: 5.12 GB, total memory: 23.55 GB I20240814 22:35:57.555889 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B I20240814 22:35:57.555910 162675 llm_engine.cpp:305] cuda:3: available memory: 5.12 GB, total memory: 23.55 GB I20240814 22:35:57.555932 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B I20240814 22:35:57.555955 162675 llm_engine.cpp:122] Initializing kv cache with size: 2.76 GB I20240814 22:35:57.555976 162675 llm_engine.cpp:333] Initializing kv cache with shape: [2056 16 2 128] I20240814 22:35:57.666657 162675 llm_engine.cpp:236] Capturing CUDA graphs: num_decoding_tokens: 1, batch sizes: 1 2 4 8 16 24 32 48 64 I20240814 22:36:00.271500 162675 llm_handler.cpp:224] Using default chat template for model type: mistral INFO: Started server process [162675] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit) INFO: 172.24.21.213:53482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
what can be the problem?