Closed pseudotensor closed 3 months ago
SGLang seems to be trying to go onto CVD=0 no matter the CVD settings. The 73GB one is a idefics2 model on TGI, but despite using export CUDA_VISIBLE_DEVICES="0,1"
before launching, SGLang ignores and uses 0,1. How to specify the GPUs?
|=======================================================================================|
| 0 N/A N/A 4006212 C /opt/conda/bin/python3.10 73028MiB |
| 0 N/A N/A 4051262 C python 8188MiB |
| 1 N/A N/A 4051266 C python 9952MiB |
+---------------------------------------------------------------------------------------+
dang, typo on my end.
按照自述文件所述通过 python 3.10 中的 pip 安装,然后运行:
export CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="0.0.0.0" --tp-size=1 --api-key='62224bfb-c832-4452-81e7-8a4bdabbe164' --random-seed=1234 --context-length=8192
GPU=1 上没有任何内容,只有 GPU=0 被填充。
总是在启动时很早就点击,模型甚至还没有加载:
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 39, in __init__ self.gate_up_proj = MergedColumnParallelLinear( File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 333, in __init__ super().__init__(input_size, sum(output_sizes), bias, gather_output, File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 236, in __init__ self.quant_method.create_weights(self, File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights weight = Parameter(torch.empty(output_size_per_partition, File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU Initialization failed. detoken_init_state: init ok
我不敢相信这个型号需要 >80GB。
使用 CVD "1,2" 和 -tp-size=2 启动并下载模型,但似乎陷入困境并且永远无法完成加载等级 0 的权重。
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:100: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10004 server started on [0.0.0.0]:10005 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 41688) with fd 36 welcome ('127.0.0.1', 41688) accepted ('127.0.0.1', 48486) with fd 32 welcome ('127.0.0.1', 48486) /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead. warnings.warn( /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead. warnings.warn( NCCL version 2.20.5+cuda12.4 Rank 1: load weight begin. Rank 0: load weight begin. config.json: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.76k/4.76k [00:00<00:00, 41.2MB/s] pytorch_model.bin: 100%|??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 1.71G/1.71G [00:05<00:00, 319MB/s] Using model weights format ['*.safetensors'] model-00001-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.98G/4.98G [00:07<00:00, 636MB/s] model-00002-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 5.00G/5.00G [00:06<00:00, 798MB/s] model-00003-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 4.92G/4.92G [00:06<00:00, 750MB/s] model-00004-of-00004.safetensors: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 1.82G/1.82G [00:04<00:00, 383MB/s] Using model weights format ['*.safetensors'] Rank 1: load weight end.
May I ask where the problem occurred and how it was resolved? I have the same problem as you initially had
I had misspelled "CUDA_VISIBLE_DEVICES" and it was still running on a GPU that was consumed already.
Installed via pip in python 3.10 as readme says, then ran:
nothing is on GPU=1, only GPU=0 is filled.
Always hit very early on startup, model not even loaded yet:
I can't believe >80GB needed for this model.
Using CVD "1,2" and -tp-size=2 starts and downloads the model, but seems to get stuck and never gets done loading weight for rank 0.