sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.94k stars 488 forks source link

no longer can load 72b llava qwen on 4*H100 80GB #485

Closed pseudotensor closed 3 months ago

pseudotensor commented 5 months ago

After updating to latest main from March 24 version of main, I can no longer run 72b without some kind of OOM.

pip uninstall sglang
git pull
cd python
pip install -e ".[all]"

then

export CUDA_VISIBLE_DEVICES="3,4,5,6"
python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30010 --host="0.0.0.0" --tp-size=4 --random-seed=1234 --context-length=32768  &> 72balone.log &

Always leads now to errors below. I also tried --mem-fraction-static=0.9 or --mem-fraction-static=0.99 and latter gets through further but then fails later still. Before I didn't have this option set at all and was working.

failure with --mem-fraction-static=0.9:

~
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:100: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:30016
server started on [0.0.0.0]:30015
server started on [0.0.0.0]:30017
server started on [0.0.0.0]:30018
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 41644) with fd 33
accepted ('127.0.0.1', 34772) with fd 44
welcome ('127.0.0.1', 34772)
welcome ('127.0.0.1', 41644)
accepted ('127.0.0.1', 34388) with fd 33
accepted ('127.0.0.1', 41534) with fd 33
welcome ('127.0.0.1', 41534)
welcome ('127.0.0.1', 34388)
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
[rank=3] Init torch begin. Avail mem=78.81 GB
[rank=0] Init torch begin. Avail mem=78.81 GB
[rank=2] Init torch begin. Avail mem=78.81 GB
[rank=1] Init torch begin. Avail mem=78.81 GB
[rank=0] Init torch end.
[rank=3] Init torch end.
[rank=2] Init torch end.
[rank=1] Init torch end.
NCCL version 2.20.5+cuda12.4
[rank=1] Load weight begin.
[rank=0] Load weight begin.
[rank=2] Load weight begin.
[rank=3] Load weight begin.
INFO 05-27 22:18:33 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:18:33 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:18:33 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:18:33 weight_utils.py:199] Using model weights format ['*.safetensors']
[rank=2] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.59 GB
[rank=3] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=6.10 GB
[rank=0] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.64 GB
[rank=1] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.59 GB
Initialization failed. router_init_state: Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/manager.py", line 71, in start_router_process
    model_client = ModelRpcClient(server_args, port_args, model_overide_args)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 777, in __init__
    self.step = async_wrap("step")
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 768, in async_wrap
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 768, in <listcomp>
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 760, in init_model
    return self.remote_services[i].ModelRpcServer(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/netref.py", line 239, in __call__
    return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/netref.py", line 63, in syncreq
    return conn.sync_request(handler, proxy, *args)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/protocol.py", line 744, in sync_request
    return _async_res.value
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/async_.py", line 111, in value
    raise self._obj
_get_exception_class.<locals>.Derived: Not enought memory. Please try to increase --mem-fraction-static.

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/protocol.py", line 369, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/protocol.py", line 863, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 73, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 257, in __init__
    self.init_memory_pool(total_gpu_memory)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 307, in init_memory_pool
    raise RuntimeError(
RuntimeError: Not enought memory. Please try to increase --mem-fraction-static.

Initialization failed. detoken_init_state: init ok
goodbye ('127.0.0.1', 41644)
goodbye ('127.0.0.1', 41534)
goodbye ('127.0.0.1', 34772)
goodbye ('127.0.0.1', 34388)

0.98 or 0.99:

/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:100: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:30016
server started on [0.0.0.0]:30015
server started on [0.0.0.0]:30017
server started on [0.0.0.0]:30018
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 48504) with fd 33
accepted ('127.0.0.1', 45472) with fd 44
welcome ('127.0.0.1', 45472)
welcome ('127.0.0.1', 48504)
accepted ('127.0.0.1', 34310) with fd 33
welcome ('127.0.0.1', 34310)
accepted ('127.0.0.1', 35260) with fd 33
welcome ('127.0.0.1', 35260)
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:140: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
[rank=0] Init torch begin. Avail mem=78.81 GB
[rank=2] Init torch begin. Avail mem=78.81 GB
[rank=3] Init torch begin. Avail mem=78.81 GB
[rank=1] Init torch begin. Avail mem=78.81 GB
[rank=1] Init torch end.
[rank=2] Init torch end.
[rank=0] Init torch end.
[rank=3] Init torch end.
NCCL version 2.20.5+cuda12.4
[rank=0] Load weight begin.
[rank=2] Load weight begin.
[rank=1] Load weight begin.
[rank=3] Load weight begin.
INFO 05-27 22:27:47 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:27:47 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:27:47 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:27:47 weight_utils.py:199] Using model weights format ['*.safetensors']
[rank=1] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.59 GB
[rank=0] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.64 GB
[rank=2] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.59 GB
[rank=3] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=6.10 GB
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[rank=1] max_total_num_tokens=6574, max_prefill_tokens=32768, context_len=32768, 
[rank=2] max_total_num_tokens=6574, max_prefill_tokens=32768, context_len=32768, 
[rank=3] max_total_num_tokens=6574, max_prefill_tokens=32768, context_len=32768, 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[rank=0] max_total_num_tokens=6574, max_prefill_tokens=32768, context_len=32768, 
server_args: enable_flashinfer=False, attention_reduce_in_fp32=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_disk_cache=False, 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [409947]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:30010 (Press CTRL+C to quit)
INFO:     127.0.0.1:51630 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 6. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.

compute-permanent-node-406:410040:411021 [1] transport/nvls.cc:147 NCCL WARN Cuda failure 2 'out of memory'

compute-permanent-node-406:410044:411018 [0] transport/nvls.cc:147 NCCL WARN Cuda failure 2 'out of memory'
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 189, in exposed_step
    self.forward_step()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 204, in forward_step
    self.forward_fill_batch(new_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 441, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 404, in forward
    return self.forward_extend_multi_modal(batch)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 393, in forward_extend_multi_modal
    return self.model.forward(
  File "/home/ubuntu/sglang/python/sglang/srt/models/llava.py", line 103, in forward
    input_embeds = self.language_model.model.embed_tokens(input_ids)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 105, in forward
    output = tensor_model_parallel_all_reduce(output_parallel)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 39, in tensor_model_parallel_all_reduce
    torch.distributed.all_reduce(input_,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 2 'out of memory'

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 189, in exposed_step
    self.forward_step()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 204, in forward_step
    self.forward_fill_batch(new_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 441, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 404, in forward
    return self.forward_extend_multi_modal(batch)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 393, in forward_extend_multi_modal
    return self.model.forward(
  File "/home/ubuntu/sglang/python/sglang/srt/models/llava.py", line 103, in forward
    input_embeds = self.language_model.model.embed_tokens(input_ids)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 105, in forward
    output = tensor_model_parallel_all_reduce(output_parallel)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 39, in tensor_model_parallel_all_reduce
    torch.distributed.all_reduce(input_,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 2 'out of memory'

compute-permanent-node-406:410045:411020 [2] transport/nvls.cc:147 NCCL WARN Cuda failure 2 'out of memory'

compute-permanent-node-406:410046:411019 [3] transport/nvls.cc:147 NCCL WARN Cuda failure 2 'out of memory'
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 189, in exposed_step
    self.forward_step()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 204, in forward_step
    self.forward_fill_batch(new_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 441, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 404, in forward
    return self.forward_extend_multi_modal(batch)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 393, in forward_extend_multi_modal
    return self.model.forward(
  File "/home/ubuntu/sglang/python/sglang/srt/models/llava.py", line 103, in forward
    input_embeds = self.language_model.model.embed_tokens(input_ids)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 105, in forward
    output = tensor_model_parallel_all_reduce(output_parallel)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 39, in tensor_model_parallel_all_reduce
    torch.distributed.all_reduce(input_,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 2 'out of memory'

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 189, in exposed_step
    self.forward_step()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 204, in forward_step
    self.forward_fill_batch(new_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 441, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 404, in forward
    return self.forward_extend_multi_modal(batch)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 393, in forward_extend_multi_modal
    return self.model.forward(
  File "/home/ubuntu/sglang/python/sglang/srt/models/llava.py", line 103, in forward
    input_embeds = self.language_model.model.embed_tokens(input_ids)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 105, in forward
    output = tensor_model_parallel_all_reduce(output_parallel)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 39, in tensor_model_parallel_all_reduce
    torch.distributed.all_reduce(input_,
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 2 'out of memory'

/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py:253: UserWarning: Warning: available_size=6568, max_total_num_tokens=6574
KV cache pool leak detected!
  warnings.warn(
/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py:253: UserWarning: Warning: available_size=6568, max_total_num_tokens=6574
KV cache pool leak detected!
  warnings.warn(
/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py:253: UserWarning: Warning: available_size=6568, max_total_num_tokens=6574
KV cache pool leak detected!
  warnings.warn(
/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py:253: UserWarning: Warning: available_size=6568, max_total_num_tokens=6574
KV cache pool leak detected!
  warnings.warn(
INFO:     172.16.0.42:22098 - "GET /health HTTP/1.1" 200 OK
INFO:     172.16.0.42:28046 - "GET /health HTTP/1.1" 200 OK

no option:

[rank=1] Init torch begin. Avail mem=78.81 GB
[rank=3] Init torch begin. Avail mem=78.81 GB
[rank=1] Init torch end.
[rank=0] Init torch end.
[rank=3] Init torch end.
[rank=2] Init torch end.
NCCL version 2.20.5+cuda12.4
[rank=0] Load weight begin.
[rank=2] Load weight begin.
[rank=3] Load weight begin.
[rank=1] Load weight begin.
INFO 05-27 22:30:04 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:30:04 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:30:04 weight_utils.py:199] Using model weights format ['*.safetensors']
INFO 05-27 22:30:04 weight_utils.py:199] Using model weights format ['*.safetensors']
[rank=3] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=6.10 GB
[rank=1] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.59 GB
[rank=0] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.64 GB
[rank=2] Load weight end. Type=LlavaQwenForCausalLM. Avail mem=5.59 GB
Initialization failed. router_init_state: Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/manager.py", line 71, in start_router_process
    model_client = ModelRpcClient(server_args, port_args, model_overide_args)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 777, in __init__
    self.step = async_wrap("step")
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 768, in async_wrap
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 768, in <listcomp>
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 760, in init_model
    return self.remote_services[i].ModelRpcServer(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/netref.py", line 239, in __call__
    return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/netref.py", line 63, in syncreq
    return conn.sync_request(handler, proxy, *args)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/protocol.py", line 744, in sync_request
    return _async_res.value
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/async_.py", line 111, in value
    raise self._obj
_get_exception_class.<locals>.Derived: Not enought memory. Please try to increase --mem-fraction-static.

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/protocol.py", line 369, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/rpyc/core/protocol.py", line 863, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 73, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 257, in __init__
    self.init_memory_pool(total_gpu_memory)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 307, in init_memory_pool
    raise RuntimeError(
RuntimeError: Not enought memory. Please try to increase --mem-fraction-static.

Initialization failed. detoken_init_state: init ok
goodbye ('127.0.0.1', 57306)
goodbye ('127.0.0.1', 32778)
goodbye ('127.0.0.1', 51872)
goodbye ('127.0.0.1', 56908)
pseudotensor commented 5 months ago

with --context-length=4096 also same problem. Can't be right.

Qubitium commented 5 months ago

Please check again with my PR: https://github.com/sgl-project/sglang/pull/487 and vllm 0.4.3 to check if issue is resolved. Maybe the issue is resolved somewhere here and/or vllm since your last report. I have tested multi-gpu loading and did not see obvious regression in vram usage but again under different env and diff model/gpu.

github-actions[bot] commented 3 months ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

chuangzhidan commented 2 months ago

i

Please check again with my PR: #487 and vllm 0.4.3 to check if issue is resolved. Maybe the issue is resolved somewhere here and/or vllm since your last report. I have tested multi-gpu loading and did not see obvious regression in vram usage but again under different env and diff model/gpu.

i cannot even load a int3 qwen-72b model with 50G available ,which ususally only takes up only 37G memory with vllm .it sucks