Open fangzhouli opened 11 months ago
Hello Same issue here with
Tue Dec 12 15:55:25 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GRID P40-24Q On | 00000000:01:00.0 Off | N/A |
| N/A N/A P8 N/A / N/A | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
I raise this errors with this cli :
python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model mistralai/Mixtral-8X7B-Instruct-v0.1 --tensor-parallel-size 1 --dtype half --load-format pt
INFO 12-12 15:56:02 api_server.py:719] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], served_model_name=None, chat_template=None, response_role='assistant', model='mistralai/Mixtral-8X7B-Instruct-v0.1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='pt', dtype='half', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 12-12 15:56:02 config.py:447] Casting torch.bfloat16 to torch.float16.
INFO 12-12 15:56:02 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mixtral-8X7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8X7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=pt, tensor_parallel_size=1, quantization=None, seed=0)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/delta/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 729, in
hope this could help you I really appreciate your insane job
Hi, thank you for the amazing model! Super excited to test it out!
I am trying to load it into my GeForce RTX 3090 (24 GB vRAM), which I believe to be more than enough for inference with 8 or 4 bit quantization. (I have tested on LLaMa 2-7B, and it worked.)
However, it was always killed before loading checkpoint shards. I am wondering if anyone has encountered a similar situation.
Background info: