sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.42k stars 399 forks source link

OOM CUDA error on 8 * L4 machine when launching sglang server #445

Closed mounamokaddem closed 2 months ago

mounamokaddem commented 4 months ago

Hey!

I m trying launching a sglang server with OpenBioLLM 70b with the command python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 but I got on the 2 issues:

  1. It errors out with OOM CUDA, I tried playing around with all possible memory arguments but still have the issue, for e.g running python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache errors out, I tried decreasing the mem-fraction-static or try different values with tp but still fails, here is the error
    
    `Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    server started on [0.0.0.0]:10014
    server started on [0.0.0.0]:10017
    server started on [0.0.0.0]:10018
    server started on [0.0.0.0]:10016
    server started on [0.0.0.0]:10019
    server started on [0.0.0.0]:10015
    server started on [0.0.0.0]:10020
    server started on [0.0.0.0]:10021
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    accepted ('127.0.0.1', 55860) with fd 52
    welcome ('127.0.0.1', 55860)
    accepted ('127.0.0.1', 54770) with fd 32
    welcome ('127.0.0.1', 54770)
    accepted ('127.0.0.1', 37120) with fd 33
    welcome ('127.0.0.1', 37120)
    accepted ('127.0.0.1', 38382) with fd 28
    welcome ('127.0.0.1', 38382)
    accepted ('127.0.0.1', 57702) with fd 29
    welcome ('127.0.0.1', 57702)
    accepted ('127.0.0.1', 55900) with fd 24
    welcome ('127.0.0.1', 55900)
    accepted ('127.0.0.1', 37206) with fd 24
    welcome ('127.0.0.1', 37206)
    accepted ('127.0.0.1', 47836) with fd 24
    welcome ('127.0.0.1', 47836)
    Rank 4: load weight begin.
    Rank 5: load weight begin.
    Rank 7: load weight begin.
    Rank 0: load weight begin.
    Rank 6: load weight begin.
    Rank 1: load weight begin.
    Rank 2: load weight begin.
    Rank 3: load weight begin.
    Initialization failed. router_init_state: Traceback (most recent call last):
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 71, in start_router_process
    model_client = ModelRpcClient(server_args, port_args, model_overide_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 724, in __init__
    self.step = async_wrap("step")
                ^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 715, in async_wrap
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 715, in <listcomp>
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 707, in init_model
    return self.remote_services[i].ModelRpcServer(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/netref.py", line 239, in __call__
    return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/netref.py", line 63, in syncreq
    return conn.sync_request(handler, proxy, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 744, in sync_request
    return _async_res.value
           ^^^^^^^^^^^^^^^^
    File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/async_.py", line 111, in value
    raise self._obj
    rpyc.core.vinegar/torch.cuda._get_exception_class.<locals>.Derived: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 

========= Remote Traceback (1) ========= Traceback (most recent call last): File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 369, in _dispatch_request res = self._HANDLERS[handler](self, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 863, in _handle_call return obj(args, *dict(kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 76, in init self.model_runner = ModelRunner( ^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 285, in init self.load_model() File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 323, in load_model model = model_class( ^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 257, in init self.model = LlamaModel(config, quant_config=quant_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 217, in init [ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 218, in LlamaDecoderLayer(config, i, quant_config=quant_config) File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 166, in init self.mlp = LlamaMLP( ^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 39, in init self.gate_up_proj = MergedColumnParallelLinear( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 333, in init super().init(input_size, sum(output_sizes), bias, gather_output, File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 236, in init self.quant_method.create_weights(self, File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights weight = Parameter(torch.empty(output_size_per_partition, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__ return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU

Initialization failed. detoken_init_state: init ok goodbye ('127.0.0.1', 57702) goodbye ('127.0.0.1', 37206) goodbye ('127.0.0.1', 55900) goodbye ('127.0.0.1', 47836) goodbye ('127.0.0.1', 54770) goodbye ('127.0.0.1', 37120) goodbye ('127.0.0.1', 38382) goodbye ('127.0.0.1', 55860)`

2. It stucks with 

python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10007 server started on [0.0.0.0]:10004 server started on [0.0.0.0]:10005 server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10006 server started on [0.0.0.0]:10009 server started on [0.0.0.0]:10010 server started on [0.0.0.0]:10011 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 44596) with fd 46 welcome ('127.0.0.1', 44596) accepted ('127.0.0.1', 44648) with fd 33 welcome ('127.0.0.1', 44648) accepted ('127.0.0.1', 53648) with fd 24 welcome ('127.0.0.1', 53648) accepted ('127.0.0.1', 33128) with fd 25 welcome ('127.0.0.1', 33128) accepted ('127.0.0.1', 41686) with fd 25 welcome ('127.0.0.1', 41686) accepted ('127.0.0.1', 56570) with fd 25 welcome ('127.0.0.1', 56570) accepted ('127.0.0.1', 48382) with fd 34 welcome ('127.0.0.1', 48382) accepted ('127.0.0.1', 36272) with fd 29 welcome ('127.0.0.1', 36272) Rank 4: load weight begin. Rank 6: load weight begin. Rank 2: load weight begin. Rank 5: load weight begin. Rank 3: load weight begin. Rank 7: load weight begin. Rank 1: load weight begin. Rank 0: load weight begin. ^C


and when I do `set_default_backend(RuntimeEndpoint("http://localhost:30000"))` it errors out with connection refused 

ConnectionRefusedError Traceback (most recent call last) File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args) 1347 try: -> 1348 h.request(req.get_method(), req.selector, req.data, headers, 1349 encode_chunked=req.has_header('Transfer-encoding')) 1350 except OSError as err: # timeout error

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1286, in HTTPConnection.request(self, method, url, body, headers, encode_chunked) 1285 """Send a complete request to the server.""" -> 1286 self._send_request(method, url, body, headers, encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1332, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked) 1331 body = _encode(body, 'body') -> 1332 self.endheaders(body, encode_chunked=encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1281, in HTTPConnection.endheaders(self, message_body, encode_chunked) 1280 raise CannotSendHeader() -> 1281 self._send_output(message_body, encode_chunked=encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1041, in HTTPConnection._send_output(self, message_body, encode_chunked) 1040 del self._buffer[:] -> 1041 self.send(msg) 1043 if message_body is not None: 1044 ... -> 1351 raise URLError(err) 1352 r = h.getresponse() 1353 except:

URLError: <urlopen error [Errno 111] Connection refused>


**Setup**
Machine type g2-standard-96 
GPUs 8 x NVIDIA L4
Architecture x86/64

sglang version [v0.1.16](https://github.com/sgl-project/sglang/releases/tag/v0.1.16)

It is not a memory problem as the machine has a total of 192 GB memory (24 GB/GPU) and I tried running inference without sglang and it worked. Plus, I haven't tried to use [flashinfer](https://github.com/sgl-project/sglang/blob/main/docs/flashinfer.md) as this is used to accelerate inference which is not the issue for me for now. 
hnyls2002 commented 4 months ago

@mounamokaddem Try to decrease mem-fraction-static as sglang requires more free spaces to allocate when the tensor parallelism size is large.

mounamokaddem commented 4 months ago

@hnyls2002 I tried everything, as mentioned above I played around all combinations of values, for mem-fraction-static I tried from 0.1 to 0.9 with/without tensor parallelism but didn't work.

github-actions[bot] commented 2 months ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.