sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Apache License 2.0
2.8k stars 180 forks source link

lmms-lab/llava-next-72b CUDA out of memory #483

Open bingwork opened 1 month ago

bingwork commented 1 month ago

when I run sglang/examples/usage/llava/srt_llava_next_test.py, change to "lmms-lab/llava-next-72b" instead of "lmms-lab/llama3-llava-next-8b", it reports OOM as below. could anyone take a time to give some suggestions? thank you very much!

Initialization failed. router_init_state: Traceback (most recent call last): File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/manager.py", line 71, in start_router_process model_client = ModelRpcClient(server_args, port_args, model_overide_args) File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/model_rpc.py", line 739, in init self.model_server = ModelRpcService().exposed_ModelRpcServer( File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/model_rpc.py", line 73, in init self.model_runner = ModelRunner( File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/model_runner.py", line 256, in init self.load_model() File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/model_runner.py", line 279, in load_model self.model = get_model( File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model return loader.load_model(model_config=model_config, File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model model = _initialize_model(model_config, self.load_config, File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 88, in _initialize_model return model_class(config=model_config.hf_config, File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llava.py", line 298, in init super().init(config, quant_config=quant_config) File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llava.py", line 37, in init self.language_model = LlamaForCausalLM(config, quant_config=quant_config) File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 261, in init self.model = LlamaModel(config, quant_config=quant_config) File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 221, in init [ File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 222, in LlamaDecoderLayer(config, i, quant_config=quant_config) File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 170, in init self.mlp = LlamaMLP( File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 45, in init self.down_proj = RowParallelLinear( File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 633, in init self.quant_method.create_weights(self, File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights weight = Parameter(torch.empty(output_size_per_partition, File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU

Initialization failed. detoken_init_state: init ok Traceback (most recent call last): File "/home/ubuntu/wubing/sglang/examples/usage/llava/srt_llava_next_test.py", line 64, in runtime = sgl.Runtime( File "/home/ubuntu/wubing/sglang/python/sglang/api.py", line 39, in Runtime return Runtime(*args, **kwargs) File "/home/ubuntu/wubing/sglang/python/sglang/srt/server.py", line 291, in init raise RuntimeError( RuntimeError: Initialization failed. Please see the error messages above.

Luodian commented 1 month ago

I am not sure about your GPUs. Just providing a DP, I could run it with 480G A100 or 840G A100.

bingwork commented 1 month ago

I am not sure about your GPUs. Just providing a DP, I could run it with 4_80G A100 or 8_40G A100.

thanks for your replying, I use NVIDIA A100-SXM4-80GB.