Closed SinanAkkoyun closed 9 months ago
same err, use 2080ti 22G *2, use python==3.10 or python==3.8, use python==3.8 and cuda toolkit==12.1 compile, all in wsl2, all have same problam.
(llm) root@DESKTOP-1CSPSTT:~/vllm-main# python -m vllm.entrypoints.api_server --model /mnt/e/Code/text-generation-webui/models/orca-2-13B-AWQ --trust-remote-code --quantization awq
WARNING 12-04 08:31:50 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-04 08:31:50 llm_engine.py:73] Initializing an LLM engine with config: model='/mnt/e/Code/text-generation-webui/models/orca-2-13B-AWQ', tokenizer='/mnt/e/Code/text-generation-webui/models/orca-2-13B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 12-04 08:36:01 llm_engine.py:218] # GPU blocks: 898, # CPU blocks: 327
Traceback (most recent call last):
File "/root/miniconda3/envs/llm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/llm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/vllm-main/vllm/entrypoints/api_server.py", line 80, in
@HelloCard
llm = LLM(model="TheBloke/Mistral-7B-OpenOrca-AWQ", quantization="AWQ", trust_remote_code=True, dtype="half", max_model_len=16384)
This did it for me, the max_model_len! Tell me if it works for you too and I'll close the issue
@SinanAkkoyun thank you, god bless you! it solve my problam.
I have met the same question but setting max_len doesn't work .
Model:
TheBloke/Mistral-7B-OpenOrca-AWQ
(and any other Mistral AWQ model of them) Cuda:12.2