mistralai / mistral-inference

Official inference library for Mistral models
https://mistral.ai/
Apache License 2.0
9.64k stars 850 forks source link

Unabled to load to GPU with 24 GB vRAM with quantization #72

Open fangzhouli opened 11 months ago

fangzhouli commented 11 months ago

Hi, thank you for the amazing model! Super excited to test it out!

I am trying to load it into my GeForce RTX 3090 (24 GB vRAM), which I believe to be more than enough for inference with 8 or 4 bit quantization. (I have tested on LLaMa 2-7B, and it worked.)

However, it was always killed before loading checkpoint shards. I am wondering if anyone has encountered a similar situation.

Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in 8-bit or 4-bit. Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass torch_dtype=torch.float16 to remove this warning.
loading weights file pytorch_model.bin from cache at /home/fzli/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/pytorch_model.bin.index.json
Instantiating MistralForCausalLM model under default dtype torch.float16.
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards:   0%|                                                                                                                                     | 0/2 [00:00<?, ?it/s]
Killed

Background info:

delta-whiplash commented 10 months ago

Hello Same issue here with

Tue Dec 12 15:55:25 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID P40-24Q                   On  | 00000000:01:00.0 Off |                  N/A |
| N/A   N/A    P8              N/A /  N/A |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I raise this errors with this cli : python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model mistralai/Mixtral-8X7B-Instruct-v0.1 --tensor-parallel-size 1 --dtype half --load-format pt

INFO 12-12 15:56:02 api_server.py:719] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], served_model_name=None, chat_template=None, response_role='assistant', model='mistralai/Mixtral-8X7B-Instruct-v0.1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='pt', dtype='half', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 12-12 15:56:02 config.py:447] Casting torch.bfloat16 to torch.float16. INFO 12-12 15:56:02 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mixtral-8X7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8X7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=pt, tensor_parallel_size=1, quantization=None, seed=0) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/delta/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 729, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/home/delta/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 495, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/home/delta/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 269, in init self.engine = self._init_engine(args, kwargs) File "/home/delta/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine return engine_class(*args, *kwargs) File "/home/delta/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 109, in init self._init_workers(distributed_init_method) File "/home/delta/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 145, in _init_workers self._run_workers( File "/home/delta/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers self._run_workers_in_batch(workers, method, args, kwargs)) File "/home/delta/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch output = executor(*args, kwargs) File "/home/delta/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 72, in load_model self.model_runner.load_model() File "/home/delta/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 36, in load_model self.model = get_model(self.model_config) File "/home/delta/.local/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 117, in get_model model = model_class(model_config.hf_config, linear_method) File "/home/delta/.local/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 465, in init self.layers = nn.ModuleList([ File "/home/delta/.local/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 466, in MixtralDecoderLayer(config) File "/home/delta/.local/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 413, in init self.block_sparse_moe = BlockSparseMoE( File "/home/delta/.local/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 193, in init torch.empty(self.ffn_dim_per_partition self.num_experts, File "/home/delta/.local/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__ return func(args, kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 24.00 GiB of which 837.85 MiB is free. Including non-PyTorch memory, this process has 21.54 GiB memory in use. Of the allocated memory 21.25 GiB is allocated by PyTorch, and 13.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

hope this could help you I really appreciate your insane job