sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.17k stars 365 forks source link

CUDA out of memory for H100 80GB for lmms-lab/llama3-llava-next-8b #465

Closed pseudotensor closed 3 months ago

pseudotensor commented 3 months ago

Installed via pip in python 3.10 as readme says, then ran:

export CUDA_VISIBLE_DEVICES=1
python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="0.0.0.0" --tp-size=1 --api-key='62224bfb-c832-4452-81e7-8a4bdabbe164'  --random-seed=1234 --context-length=8192

nothing is on GPU=1, only GPU=0 is filled.

Always hit very early on startup, model not even loaded yet:

  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 39, in __init__
    self.gate_up_proj = MergedColumnParallelLinear(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 333, in __init__
    super().__init__(input_size, sum(output_sizes), bias, gather_output,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 236, in __init__
    self.quant_method.create_weights(self,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights
    weight = Parameter(torch.empty(output_size_per_partition,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in __torch_function__
    return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 

Initialization failed. detoken_init_state: init ok

I can't believe >80GB needed for this model.

Using CVD "1,2" and -tp-size=2 starts and downloads the model, but seems to get stuck and never gets done loading weight for rank 0.

/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:100: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10004
server started on [0.0.0.0]:10005
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 41688) with fd 36
welcome ('127.0.0.1', 41688)
accepted ('127.0.0.1', 48486) with fd 32
welcome ('127.0.0.1', 48486)
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
NCCL version 2.20.5+cuda12.4
Rank 1: load weight begin.
Rank 0: load weight begin.
config.json: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.76k/4.76k [00:00<00:00, 41.2MB/s]
pytorch_model.bin: 100%|??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 1.71G/1.71G [00:05<00:00, 319MB/s]
Using model weights format ['*.safetensors']
model-00001-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.98G/4.98G [00:07<00:00, 636MB/s]
model-00002-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 5.00G/5.00G [00:06<00:00, 798MB/s]
model-00003-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.92G/4.92G [00:06<00:00, 750MB/s]
model-00004-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 1.82G/1.82G [00:04<00:00, 383MB/s]
Using model weights format ['*.safetensors']
Rank 1: load weight end.
pseudotensor commented 3 months ago

SGLang seems to be trying to go onto CVD=0 no matter the CVD settings. The 73GB one is a idefics2 model on TGI, but despite using export CUDA_VISIBLE_DEVICES="0,1" before launching, SGLang ignores and uses 0,1. How to specify the GPUs?

|=======================================================================================|
|    0   N/A  N/A   4006212      C   /opt/conda/bin/python3.10                 73028MiB |
|    0   N/A  N/A   4051262      C   python                                     8188MiB |
|    1   N/A  N/A   4051266      C   python                                     9952MiB |
+---------------------------------------------------------------------------------------+
pseudotensor commented 3 months ago

dang, typo on my end.

Bill-WangJiLong commented 3 months ago

按照自述文件所述通过 python 3.10 中的 pip 安装,然后运行:

export CUDA_VISIBLE_DEVICES=1
python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="0.0.0.0" --tp-size=1 --api-key='62224bfb-c832-4452-81e7-8a4bdabbe164'  --random-seed=1234 --context-length=8192

GPU=1 上没有任何内容,只有 GPU=0 被填充。

总是在启动时很早就点击,模型甚至还没有加载:

  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 39, in __init__
    self.gate_up_proj = MergedColumnParallelLinear(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 333, in __init__
    super().__init__(input_size, sum(output_sizes), bias, gather_output,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 236, in __init__
    self.quant_method.create_weights(self,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights
    weight = Parameter(torch.empty(output_size_per_partition,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in __torch_function__
    return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 

Initialization failed. detoken_init_state: init ok

我不敢相信这个型号需要 >80GB。

使用 CVD "1,2" 和 -tp-size=2 启动并下载模型,但似乎陷入困境并且永远无法完成加载等级 0 的权重。

/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:100: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10004
server started on [0.0.0.0]:10005
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 41688) with fd 36
welcome ('127.0.0.1', 41688)
accepted ('127.0.0.1', 48486) with fd 32
welcome ('127.0.0.1', 48486)
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
NCCL version 2.20.5+cuda12.4
Rank 1: load weight begin.
Rank 0: load weight begin.
config.json: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.76k/4.76k [00:00<00:00, 41.2MB/s]
pytorch_model.bin: 100%|??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 1.71G/1.71G [00:05<00:00, 319MB/s]
Using model weights format ['*.safetensors']
model-00001-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.98G/4.98G [00:07<00:00, 636MB/s]
model-00002-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 5.00G/5.00G [00:06<00:00, 798MB/s]
model-00003-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.92G/4.92G [00:06<00:00, 750MB/s]
model-00004-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 1.82G/1.82G [00:04<00:00, 383MB/s]
Using model weights format ['*.safetensors']
Rank 1: load weight end.

May I ask where the problem occurred and how it was resolved? I have the same problem as you initially had

pseudotensor commented 3 months ago

I had misspelled "CUDA_VISIBLE_DEVICES" and it was still running on a GPU that was consumed already.