vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.58k stars 3.89k forks source link

[Bug]: CUDA OOM error when loading another model after exiting the first one. #6682

Open R-C101 opened 1 month ago

R-C101 commented 1 month ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 555.42.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             8
On-line CPU(s) list:                0-7
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
CPU family:                         6
Model:                              79
Thread(s) per core:                 2
Core(s) per socket:                 4
Socket(s):                          1
Stepping:                           1
CPU max MHz:                        3000.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           4600.02
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
Hypervisor vendor:                  Xen
Virtualization type:                full
L1d cache:                          128 KiB (4 instances)
L1i cache:                          128 KiB (4 instances)
L2 cache:                           1 MiB (4 instances)
L3 cache:                           45 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Versions of relevant libraries:
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnx==1.16.0
[pip3] onnxruntime==1.17.3
[pip3] optree==0.10.0
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.3.0
[pip3] torch-tensorrt==2.3.0a0
[pip3] torchaudio==2.3.0+cu118
[pip3] torchdata==0.7.1a0
[pip3] torchtext==0.17.0a0
[pip3] torchvision==0.18.0+cu118
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-7     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

πŸ› Describe the bug

Unloading a model from memory doesn't work with the solutions provided. Please advice on how to use 2 models back to back. Also worthy to note that according to (https://github.com/vllm-project/vllm/issues/1908#issuecomment-2101122008) subsequent models do get terminated however the first one still remains.

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel
from vllm import LLM, SamplingParams
import gc
import torch

def show_memory_usage():
    import torch.cuda
    import torch.distributed
    import gc

    print(f"cuda memory: {torch.cuda.memory_allocated()//1024//1024}MB")
    gc.collect()
    # torch.distributed.destroy_process_group()
    torch.cuda.empty_cache()
    print(f"  --> after gc: {torch.cuda.memory_allocated()//1024//1024}MB")
show_memory_usage()    
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
print('model 1')
llm = LLM(model="lmsys/vicuna-7b-v1.5",tensor_parallel_size = 1,max_model_len = 1024)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

#using a workaround given in Issue 1908
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker

del llm # Isn't necessary for releasing memory, but why not
show_memory_usage()
gc.collect()
torch.cuda.empty_cache()
import ray
ray.shutdown()
show_memory_usage()
print('model2')
llm = LLM(model="lmsys/vicuna-7b-v1.5",tensor_parallel_size = 1,max_model_len = 1024)
outputs2 = llm.generate(prompts, sampling_params)
# Print the outputs.
show_memory_usage()
for output in outputs2:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
show_memory_usage()

Traceback:

cuda memory: 0MB
  --> after gc: 0MB
model 1
INFO 07-23 10:26:00 llm_engine.py:103] Initializing an LLM engine (v0.4.1) with config: model='lmsys/vicuna-7b-v1.5', speculative_config=None, tokenizer='lmsys/vicuna-7b-v1.5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
=================Setting Up Params===================
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
INFO 07-23 10:26:02 utils.py:620] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
INFO 07-23 10:26:03 selector.py:65] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-23 10:26:03 selector.py:33] Using XFormers backend.
INFO 07-23 10:26:04 weight_utils.py:199] Using model weights format ['*.bin']
INFO 07-23 10:26:15 model_runner.py:172] Loading model weights took 12.5523 GB
INFO 07-23 10:26:16 gpu_executor.py:114] # GPU blocks: 114, # CPU blocks: 512
INFO 07-23 10:26:18 model_runner.py:871] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-23 10:26:18 model_runner.py:875] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-23 10:26:26 model_runner.py:952] Graph capturing finished in 7 secs.
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00,  9.84it/s]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Prompt: 'Hello, my name is', Generated text: " Dima and I'm a graphic designer from Ukraine. As a kid"
Prompt: 'The president of the United States is', Generated text: ' a member of the executive branch of the federal government of the United States and serves'
Prompt: 'The capital of France is', Generated text: ' Paris.607. What is the population of Paris?\nThe population'
Prompt: 'The future of AI is', Generated text: ' likely to bring about many changes in the way we live and work. ΠΌΡƒΠ·Π΅'
cuda memory: 13819MB
  --> after gc: 13819MB
cuda memory: 13819MB
  --> after gc: 13819MB
model2
INFO 07-23 10:26:27 llm_engine.py:103] Initializing an LLM engine (v0.4.1) with config: model='lmsys/vicuna-7b-v1.5', speculative_config=None, tokenizer='lmsys/vicuna-7b-v1.5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
=================Setting Up Params===================
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[1], line 47
     45 show_memory_usage()
     46 print('model2')
---> 47 llm = LLM(model="lmsys/vicuna-7b-v1.5",tensor_parallel_size = 1,max_model_len = 1024)
     48 outputs2 = llm.generate(prompts, sampling_params)
     49 # Print the outputs.

File /workspace/vllm/vllm/entrypoints/llm.py:123, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, disable_custom_all_reduce, **kwargs)
    103     kwargs["disable_log_stats"] = True
    104 engine_args = EngineArgs(
    105     model=model,
    106     tokenizer=tokenizer,
   (...)
    121     **kwargs,
    122 )
--> 123 self.llm_engine = LLMEngine.from_engine_args(
    124     engine_args, usage_context=UsageContext.LLM_CLASS)
    125 self.request_counter = Counter()

File /workspace/vllm/vllm/engine/llm_engine.py:295, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
    292     executor_class = GPUExecutor
    294 # Create the LLM engine.
--> 295 engine = cls(
    296     **engine_config.to_dict(),
    297     executor_class=executor_class,
    298     log_stats=not engine_args.disable_log_stats,
    299     usage_context=usage_context,
    300 )
    301 return engine

File /workspace/vllm/vllm/engine/llm_engine.py:162, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config, decoding_config, executor_class, log_stats, usage_context)
    158 self.seq_counter = Counter()
    159 self.generation_config_fields = _load_generation_config_dict(
    160     model_config)
--> 162 self.model_executor = executor_class(
    163     model_config=model_config,
    164     cache_config=cache_config,
    165     parallel_config=parallel_config,
    166     scheduler_config=scheduler_config,
    167     device_config=device_config,
    168     lora_config=lora_config,
    169     vision_language_config=vision_language_config,
    170     speculative_config=speculative_config,
    171     load_config=load_config,
    172 )
    174 if not self.model_config.embedding_mode:
    175     self._initialize_kv_caches()

File /workspace/vllm/vllm/executor/executor_base.py:41, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config)
     38 self.vision_language_config = vision_language_config
     39 self.speculative_config = speculative_config
---> 41 self._init_executor()

File /workspace/vllm/vllm/executor/gpu_executor.py:23, in GPUExecutor._init_executor(self)
     17 """Initialize the worker and load the model.
     18 
     19 If speculative decoding is enabled, we instead create the speculative
     20 worker.
     21 """
     22 if self.speculative_config is None:
---> 23     self._init_non_spec_worker()
     24 else:
     25     self._init_spec_worker()

File /workspace/vllm/vllm/executor/gpu_executor.py:69, in GPUExecutor._init_non_spec_worker(self)
     67 self.driver_worker = self._create_worker()
     68 self.driver_worker.init_device()
---> 69 self.driver_worker.load_model()

File /workspace/vllm/vllm/worker/worker.py:121, in Worker.load_model(self)
    120 def load_model(self):
--> 121     self.model_runner.load_model()

File /workspace/vllm/vllm/worker/model_runner.py:161, in ModelRunner.load_model(self)
    159 def load_model(self) -> None:
    160     with CudaMemoryProfiler() as m:
--> 161         self.model = get_model(
    162             model_config=self.model_config,
    163             device_config=self.device_config,
    164             load_config=self.load_config,
    165             lora_config=self.lora_config,
    166             vision_language_config=self.vision_language_config,
    167             parallel_config=self.parallel_config,
    168             scheduler_config=self.scheduler_config,
    169         )
    171     self.model_memory_usage = m.consumed_memory
    172     logger.info("Loading model weights took %.4f GB",
    173                 self.model_memory_usage / float(2**30))

File /workspace/vllm/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, vision_language_config)
     13 def get_model(
     14         *, model_config: ModelConfig, load_config: LoadConfig,
     15         device_config: DeviceConfig, parallel_config: ParallelConfig,
     16         scheduler_config: SchedulerConfig, lora_config: Optional[LoRAConfig],
     17         vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
     18     loader = get_model_loader(load_config)
---> 19     return loader.load_model(model_config=model_config,
     20                              device_config=device_config,
     21                              lora_config=lora_config,
     22                              vision_language_config=vision_language_config,
     23                              parallel_config=parallel_config,
     24                              scheduler_config=scheduler_config)

File /workspace/vllm/vllm/model_executor/model_loader/loader.py:221, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, vision_language_config, parallel_config, scheduler_config)
    219 with set_default_torch_dtype(model_config.dtype):
    220     with torch.device(device_config.device):
--> 221         model = _initialize_model(model_config, self.load_config,
    222                                   lora_config, vision_language_config)
    223     model.load_weights(
    224         self._get_weights_iterator(model_config.model,
    225                                    model_config.revision,
   (...)
    228                                        "fall_back_to_pt_during_load",
    229                                        True)), )
    230     for _, module in model.named_modules():

File /workspace/vllm/vllm/model_executor/model_loader/loader.py:87, in _initialize_model(model_config, load_config, lora_config, vision_language_config)
     84 model_class = get_model_architecture(model_config)[0]
     85 quant_config = _get_quantization_config(model_config, load_config)
---> 87 return model_class(config=model_config.hf_config,
     88                    quant_config=quant_config,
     89                    **_get_model_initialization_kwargs(
     90                        model_class, lora_config, vision_language_config))

File /workspace/vllm/vllm/model_executor/models/llama.py:376, in LlamaForCausalLM.__init__(self, config, quant_config, lora_config)
    374 super().__init__()
    375 self.config = config
--> 376 self.model = LlamaModel(config, quant_config, lora_config=lora_config)
    377 self.unpadded_vocab_size = config.vocab_size
    378 if lora_config:

File /workspace/vllm/vllm/model_executor/models/llama.py:267, in LlamaModel.__init__(self, config, quant_config, lora_config)
    261 self.org_vocab_size = config.vocab_size
    262 self.embed_tokens = VocabParallelEmbedding(
    263     self.vocab_size,
    264     config.hidden_size,
    265     org_num_embeddings=config.vocab_size,
    266 )
--> 267 self.layers = nn.ModuleList([
    268     LlamaDecoderLayer(config, quant_config)
    269     for _ in range(config.num_hidden_layers)
    270 ])
    271 self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

File /workspace/vllm/vllm/model_executor/models/llama.py:268, in <listcomp>(.0)
    261 self.org_vocab_size = config.vocab_size
    262 self.embed_tokens = VocabParallelEmbedding(
    263     self.vocab_size,
    264     config.hidden_size,
    265     org_num_embeddings=config.vocab_size,
    266 )
    267 self.layers = nn.ModuleList([
--> 268     LlamaDecoderLayer(config, quant_config)
    269     for _ in range(config.num_hidden_layers)
    270 ])
    271 self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

File /workspace/vllm/vllm/model_executor/models/llama.py:207, in LlamaDecoderLayer.__init__(self, config, quant_config)
    193 attention_bias = getattr(config, "attention_bias", False) or getattr(
    194     config, "bias", False)
    195 self.self_attn = LlamaAttention(
    196     hidden_size=self.hidden_size,
    197     num_heads=config.num_attention_heads,
   (...)
    205     sliding_window=sliding_window,
    206 )
--> 207 self.mlp = LlamaMLP(
    208     hidden_size=self.hidden_size,
    209     intermediate_size=config.intermediate_size,
    210     hidden_act=config.hidden_act,
    211     quant_config=quant_config,
    212 )
    213 self.input_layernorm = RMSNorm(config.hidden_size,
    214                                eps=config.rms_norm_eps)
    215 self.post_attention_layernorm = RMSNorm(config.hidden_size,
    216                                         eps=config.rms_norm_eps)

File /workspace/vllm/vllm/model_executor/models/llama.py:67, in LlamaMLP.__init__(self, hidden_size, intermediate_size, hidden_act, quant_config)
     62 super().__init__()
     63 self.gate_up_proj = MergedColumnParallelLinear(
     64     hidden_size, [intermediate_size] * 2,
     65     bias=False,
     66     quant_config=quant_config)
---> 67 self.down_proj = RowParallelLinear(intermediate_size,
     68                                    hidden_size,
     69                                    bias=False,
     70                                    quant_config=quant_config)
     71 if hidden_act != "silu":
     72     raise ValueError(f"Unsupported activation: {hidden_act}. "
     73                      "Only silu is supported for now.")

File /workspace/vllm/vllm/model_executor/layers/linear.py:633, in RowParallelLinear.__init__(self, input_size, output_size, bias, input_is_parallel, skip_bias_add, params_dtype, reduce_results, quant_config)
    631 # All the linear layer supports quant method.
    632 assert self.quant_method is not None
--> 633 self.quant_method.create_weights(self,
    634                                  self.input_size_per_partition,
    635                                  [self.output_size],
    636                                  self.input_size,
    637                                  self.output_size,
    638                                  self.params_dtype,
    639                                  weight_loader=self.weight_loader)
    641 if not reduce_results and (bias and not skip_bias_add):
    642     raise ValueError("When not reduce the results, adding bias to the "
    643                      "results can lead to incorrect results")

File /workspace/vllm/vllm/model_executor/layers/linear.py:81, in UnquantizedLinearMethod.create_weights(self, layer, input_size_per_partition, output_partition_sizes, input_size, output_size, params_dtype, **extra_weight_attrs)
     75 def create_weights(self, layer: torch.nn.Module,
     76                    input_size_per_partition: int,
     77                    output_partition_sizes: List[int], input_size: int,
     78                    output_size: int, params_dtype: torch.dtype,
     79                    **extra_weight_attrs):
     80     output_size_per_partition = sum(output_partition_sizes)
---> 81     weight = Parameter(torch.empty(output_size_per_partition,
     82                                    input_size_per_partition,
     83                                    dtype=params_dtype),
     84                        requires_grad=False)
     85     set_weight_attrs(weight, {"input_dim": 1, "output_dim": 0})
     86     layer.register_parameter("weight", weight)

File /usr/local/lib/python3.10/dist-packages/torch/utils/_device.py:78, in DeviceContext.__torch_function__(self, func, types, args, kwargs)
     76 if func in _device_constructors() and kwargs.get('device') is None:
     77     kwargs['device'] = self.device
---> 78 return func(*args, **kwargs)

OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 
DarkLight1337 commented 1 month ago

Can you try the method which I've suggested in #6544?

R-C101 commented 1 month ago

Can you try the method which I've suggested in #6544?

Hi it runs on the example I gave you, let me introduce this method into my workflow and see if it works there too, Thank You so much for your help.

Edit: Tested on my workflow, doesnt work the same, it works normally when using llm.generate however there is one part where im doing training using the following config:

     self.engine_args = EngineArgs(model=model_name, tensor_parallel_size=1, max_model_len=1024, dtype="float16")
        self.engine_config = self.engine_args.create_engine_config()
        self.engine_config.model_config.embedding_mode=True
        distributed_init_method = get_distributed_init_method(get_ip(), get_open_port())
        worker = Worker(
                model_config=self.engine_config.model_config,
                parallel_config=self.engine_config.parallel_config,
                scheduler_config=self.engine_config.scheduler_config,
                device_config=self.engine_config.device_config,
                cache_config=self.engine_config.cache_config,
                load_config=self.engine_config.load_config,
                local_rank=0,
                rank=0,
                distributed_init_method=distributed_init_method,
                is_driver_worker=True,
            )

        worker.init_device()
        worker.load_model()
        self.EMR = worker.model_runner

        #its used like this
        num_layers = self.engine_config.model_config.get_num_layers(self.engine_config.parallel_config)
            hs=self.EMR.execute_model(seqs, kv_caches=[None] * num_layers).to("cuda:3")
            hs=hs.reshape([self.num_generate, -1, input_embeds.shape[-1]])
            logits=output_embed_layer(hs.to(model.dtype))

Using the same method provided doesn't work in this case, what else would I have to quit/destroy here?

xansar commented 1 month ago

Can you try the method which I've suggested in #6544?

It does not work for me.😭

DarkLight1337 commented 1 month ago

Can you try the method which I've suggested in #6544?

It does not work for me.😭

Please show the code which you've used.