vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.16k stars 4.73k forks source link

OutOfMemoryError Llama2-70b offline_infer #636

Closed yinochaos closed 8 months ago

yinochaos commented 1 year ago

code

nvidia-smi 8GPUS A800
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800-SXM...  On   | 00000000:65:01.0 Off |                    0 |
| N/A   30C    P0    62W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM...  On   | 00000000:65:02.0 Off |                    0 |
| N/A   31C    P0    61W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800-SXM...  On   | 00000000:67:01.0 Off |                    0 |
| N/A   31C    P0    62W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800-SXM...  On   | 00000000:67:02.0 Off |                    0 |
| N/A   30C    P0    61W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A800-SXM...  On   | 00000000:69:01.0 Off |                    0 |
| N/A   29C    P0    62W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A800-SXM...  On   | 00000000:69:02.0 Off |                    0 |
| N/A   30C    P0    61W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A800-SXM...  On   | 00000000:6B:01.0 Off |                    0 |
| N/A   30C    P0    61W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A800-SXM...  On   | 00000000:6B:02.0 Off |                    0 |
| N/A   29C    P0    57W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

llama_infer.py

from vllm import LLM, SamplingParams
import torch
from transformers import AutoConfig
# Sample prompts.
# Create a sampling params object.
sampling_params = SamplingParams(n=1, temperature=0.3, top_p=0.85, top_k=5, max_tokens=2048, frequency_penalty=1.1)
# Create an LLM.
llm = LLM(model=model_path, dtype='bfloat16')
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
prompts = []
prefix_info = []
with open('all_toxic.response', 'w') as f:
    for line in open('all_toxic.prompts'):
        items = line.strip('\n').split('\t')
        prefix_info.append(items)
        prompts.append('<reserved_102>'+items[-1]+'<reserved_103>')
        if len(prompts) == 500:
            outputs = llm.generate(prompts, sampling_params)
            # Print the outputs.
            for output in outputs:
                prompt = output.prompt
                for out in output.outputs:
                    generated_text = out.text.replace('\n','\\n').replace('\t', ' ')
                    #print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
                    f.write('\t'.join(items+[generated_text])+'\n')
            prompts=[]
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output, items in zip(outputs, prefix_info):
        prompt = output.prompt
        for out in output.outputs:
            generated_text = out.text.replace('\n','\\n').replace('\t', ' ')
            #print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
            f.write('\t'.join(items+[generated_text])+'\n')

python llama_infer.py wrong infos

INFO 08-01 08:13:39 llm_engine.py:68] Initializing an LLM engine with config: model='/baichuan/mule/models/Llama-2-70b-chat-hf', tokenizer='/baichuan/mule/models/Llama-2-70b-chat-hf', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
model config <vllm.config.ModelConfig object at 0x7fe31339fdc0>
INFO 08-01 08:13:39 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
nchabrindqa5nbrdjvbtg:9490:9490 [0] NCCL INFO Bootstrap : Using eth0:192.18.87.126<0>
nchabrindqa5nbrdjvbtg:9490:9490 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
nchabrindqa5nbrdjvbtg:9490:9490 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
nchabrindqa5nbrdjvbtg:9490:9490 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Plugin Path : /usr/local/nccl-rdma-sharp-plugins/lib/libnccl-net.so
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO P2P plugin IBext
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth0:192.18.87.126<0>
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Using network IBext
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 00/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 01/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 02/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 03/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 04/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 05/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 06/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 07/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 08/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 09/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 10/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 11/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 12/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 13/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 14/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 15/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 16/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 17/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 18/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 19/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 20/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 21/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 22/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 23/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 24/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 25/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 26/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 27/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 28/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 29/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 30/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Channel 31/32 :    0
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Connected all rings
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO Connected all trees
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
nchabrindqa5nbrdjvbtg:9490:9641 [0] NCCL INFO comm 0x2ae7b450 rank 0 nranks 1 cudaDev 0 busId 65010 - Init COMPLETE
architectures ['LlamaForCausalLM']
Traceback (most recent call last):
  File "/baichuan/mule/venv/vllm_env/vllm_infer/llama_infer.py", line 22, in <module>
    llm = LLM(model=model_path, dtype='bfloat16')
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/entrypoints/llm.py", line 66, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 212, in from_engine_args
    engine = cls(*engine_configs,
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 103, in __init__
    self._init_workers(distributed_init_method)
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 128, in _init_workers
    self._run_workers(
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 389, in _run_workers
    output = executor(*args, **kwargs)
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/worker/worker.py", line 67, in init_model
    self.model = get_model(self.model_config)
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/model_executor/model_loader.py", line 44, in get_model
    model = model_class(model_config.hf_config)
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 236, in __init__
    self.model = LlamaModel(config)
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 200, in __init__
    self.layers = nn.ModuleList([
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 201, in <listcomp>
    LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 146, in __init__
    self.self_attn = LlamaAttention(
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 103, in __init__
    self.qkv_proj = ColumnParallelLinear(
  File "/baichuan/mule/venv/vllm_env/lib/python3.9/site-packages/vllm-0.1.2-py3.9-linux-x86_64.egg/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in __init__
    self.weight = Parameter(torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 79.35 GiB total capacity; 78.58 GiB already allocated; 145.19 MiB free; 78.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
nchabrindqa5nbrdjvbtg:9490:9646 [0] NCCL INFO [Service thread] Connection closed by localRank 0
nchabrindqa5nbrdjvbtg:9490:9490 [0] NCCL INFO comm 0x2ae7b450 rank 0 nranks 1 cudaDev 0 busId 65010 - Abort COMPLETE

but when I change 70b to 13b, it works,no any error info

ywen666 commented 1 year ago

how about setting tensor_parallel_size=2?

hmellor commented 8 months ago

Closing this issue as stale as there has been no discussion in the past 3 months.

If you are still experiencing the issue you describe, feel free to re-open this issue.