Open bingwork opened 1 month ago
I am not sure about your GPUs. Just providing a DP, I could run it with 480G A100 or 840G A100.
I am not sure about your GPUs. Just providing a DP, I could run it with 4_80G A100 or 8_40G A100.
thanks for your replying, I use NVIDIA A100-SXM4-80GB.
when I run sglang/examples/usage/llava/srt_llava_next_test.py, change to "lmms-lab/llava-next-72b" instead of "lmms-lab/llama3-llava-next-8b", it reports OOM as below. could anyone take a time to give some suggestions? thank you very much!
Initialization failed. router_init_state: Traceback (most recent call last): File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/manager.py", line 71, in start_router_process model_client = ModelRpcClient(server_args, port_args, model_overide_args) File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/model_rpc.py", line 739, in init self.model_server = ModelRpcService().exposed_ModelRpcServer( File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/model_rpc.py", line 73, in init self.model_runner = ModelRunner( File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/model_runner.py", line 256, in init self.load_model() File "/home/ubuntu/wubing/sglang/python/sglang/srt/managers/router/model_runner.py", line 279, in load_model self.model = get_model( File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model return loader.load_model(model_config=model_config, File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model model = _initialize_model(model_config, self.load_config, File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 88, in _initialize_model return model_class(config=model_config.hf_config, File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llava.py", line 298, in init super().init(config, quant_config=quant_config) File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llava.py", line 37, in init self.language_model = LlamaForCausalLM(config, quant_config=quant_config) File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 261, in init self.model = LlamaModel(config, quant_config=quant_config) File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 221, in init [ File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 222, in
LlamaDecoderLayer(config, i, quant_config=quant_config)
File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 170, in init
self.mlp = LlamaMLP(
File "/home/ubuntu/wubing/sglang/python/sglang/srt/models/llama2.py", line 45, in init
self.down_proj = RowParallelLinear(
File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 633, in init
self.quant_method.create_weights(self,
File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights
weight = Parameter(torch.empty(output_size_per_partition,
File "/home/ubuntu/anaconda3/envs/llava_py310/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in __torch_function__
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU
Initialization failed. detoken_init_state: init ok Traceback (most recent call last): File "/home/ubuntu/wubing/sglang/examples/usage/llava/srt_llava_next_test.py", line 64, in
runtime = sgl.Runtime(
File "/home/ubuntu/wubing/sglang/python/sglang/api.py", line 39, in Runtime
return Runtime(*args, **kwargs)
File "/home/ubuntu/wubing/sglang/python/sglang/srt/server.py", line 291, in init
raise RuntimeError(
RuntimeError: Initialization failed. Please see the error messages above.