OOM CUDA error on 8 * L4 machine when launching sglang server

Hey!

I m trying launching a sglang server with OpenBioLLM 70b with the command python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 but I got on the 2 issues:

It errors out with OOM CUDA, I tried playing around with all possible memory arguments but still have the issue, for e.g running

python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache

errors out, I tried decreasing the mem-fraction-static or try different values with tp but still fails, here is the error


`Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10014
server started on [0.0.0.0]:10017
server started on [0.0.0.0]:10018
server started on [0.0.0.0]:10016
server started on [0.0.0.0]:10019
server started on [0.0.0.0]:10015
server started on [0.0.0.0]:10020
server started on [0.0.0.0]:10021
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 55860) with fd 52
welcome ('127.0.0.1', 55860)
accepted ('127.0.0.1', 54770) with fd 32
welcome ('127.0.0.1', 54770)
accepted ('127.0.0.1', 37120) with fd 33
welcome ('127.0.0.1', 37120)
accepted ('127.0.0.1', 38382) with fd 28
welcome ('127.0.0.1', 38382)
accepted ('127.0.0.1', 57702) with fd 29
welcome ('127.0.0.1', 57702)
accepted ('127.0.0.1', 55900) with fd 24
welcome ('127.0.0.1', 55900)
accepted ('127.0.0.1', 37206) with fd 24
welcome ('127.0.0.1', 37206)
accepted ('127.0.0.1', 47836) with fd 24
welcome ('127.0.0.1', 47836)
Rank 4: load weight begin.
Rank 5: load weight begin.
Rank 7: load weight begin.
Rank 0: load weight begin.
Rank 6: load weight begin.
Rank 1: load weight begin.
Rank 2: load weight begin.
Rank 3: load weight begin.
Initialization failed. router_init_state: Traceback (most recent call last):
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 71, in start_router_process
model_client = ModelRpcClient(server_args, port_args, model_overide_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 724, in __init__
self.step = async_wrap("step")
            ^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 715, in async_wrap
fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 715, in <listcomp>
fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
       ^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
       ^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 707, in init_model
return self.remote_services[i].ModelRpcServer(
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/netref.py", line 239, in __call__
return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/netref.py", line 63, in syncreq
return conn.sync_request(handler, proxy, *args)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 744, in sync_request
return _async_res.value
       ^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/async_.py", line 111, in value
raise self._obj
rpyc.core.vinegar/torch.cuda._get_exception_class.<locals>.Derived: CUDA out of memory. Tried to allocate 112.00 MiB. GPU

========= Remote Traceback (1) ========= Traceback (most recent call last): File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 369, in _dispatch_request res = self._HANDLERS[handler](self, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 863, in _handle_call return obj(args, *dict(kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 76, in init self.model_runner = ModelRunner( ^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 285, in init self.load_model() File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 323, in load_model model = model_class( ^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 257, in init self.model = LlamaModel(config, quant_config=quant_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 217, in init [ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 218, in LlamaDecoderLayer(config, i, quant_config=quant_config) File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 166, in init self.mlp = LlamaMLP( ^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 39, in init self.gate_up_proj = MergedColumnParallelLinear( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 333, in init super().init(input_size, sum(output_sizes), bias, gather_output, File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 236, in init self.quant_method.create_weights(self, File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights weight = Parameter(torch.empty(output_size_per_partition, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__ return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU

Initialization failed. detoken_init_state: init ok goodbye ('127.0.0.1', 57702) goodbye ('127.0.0.1', 37206) goodbye ('127.0.0.1', 55900) goodbye ('127.0.0.1', 47836) goodbye ('127.0.0.1', 54770) goodbye ('127.0.0.1', 37120) goodbye ('127.0.0.1', 38382) goodbye ('127.0.0.1', 55860)`

2. It stucks with

python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10007 server started on [0.0.0.0]:10004 server started on [0.0.0.0]:10005 server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10006 server started on [0.0.0.0]:10009 server started on [0.0.0.0]:10010 server started on [0.0.0.0]:10011 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 44596) with fd 46 welcome ('127.0.0.1', 44596) accepted ('127.0.0.1', 44648) with fd 33 welcome ('127.0.0.1', 44648) accepted ('127.0.0.1', 53648) with fd 24 welcome ('127.0.0.1', 53648) accepted ('127.0.0.1', 33128) with fd 25 welcome ('127.0.0.1', 33128) accepted ('127.0.0.1', 41686) with fd 25 welcome ('127.0.0.1', 41686) accepted ('127.0.0.1', 56570) with fd 25 welcome ('127.0.0.1', 56570) accepted ('127.0.0.1', 48382) with fd 34 welcome ('127.0.0.1', 48382) accepted ('127.0.0.1', 36272) with fd 29 welcome ('127.0.0.1', 36272) Rank 4: load weight begin. Rank 6: load weight begin. Rank 2: load weight begin. Rank 5: load weight begin. Rank 3: load weight begin. Rank 7: load weight begin. Rank 1: load weight begin. Rank 0: load weight begin. ^C


and when I do `set_default_backend(RuntimeEndpoint("http://localhost:30000"))` it errors out with connection refused

ConnectionRefusedError Traceback (most recent call last) File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args) 1347 try: -> 1348 h.request(req.get_method(), req.selector, req.data, headers, 1349 encode_chunked=req.has_header('Transfer-encoding')) 1350 except OSError as err: # timeout error

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1286, in HTTPConnection.request(self, method, url, body, headers, encode_chunked) 1285 """Send a complete request to the server.""" -> 1286 self._send_request(method, url, body, headers, encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1332, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked) 1331 body = _encode(body, 'body') -> 1332 self.endheaders(body, encode_chunked=encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1281, in HTTPConnection.endheaders(self, message_body, encode_chunked) 1280 raise CannotSendHeader() -> 1281 self._send_output(message_body, encode_chunked=encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1041, in HTTPConnection._send_output(self, message_body, encode_chunked) 1040 del self._buffer[:] -> 1041 self.send(msg) 1043 if message_body is not None: 1044 ... -> 1351 raise URLError(err) 1352 r = h.getresponse() 1353 except:

URLError: <urlopen error [Errno 111] Connection refused>


**Setup**
Machine type g2-standard-96 
GPUs 8 x NVIDIA L4
Architecture x86/64

sglang version [v0.1.16](https://github.com/sgl-project/sglang/releases/tag/v0.1.16)

It is not a memory problem as the machine has a total of 192 GB memory (24 GB/GPU) and I tried running inference without sglang and it worked. Plus, I haven't tried to use [flashinfer](https://github.com/sgl-project/sglang/blob/main/docs/flashinfer.md) as this is used to accelerate inference which is not the issue for me for now.

sgl-project / sglang

OOM CUDA error on 8 * L4 machine when launching sglang server #445