Closed mounamokaddem closed 2 months ago
@mounamokaddem Try to decrease mem-fraction-static
as sglang requires more free spaces to allocate when the tensor parallelism size is large.
@hnyls2002 I tried everything, as mentioned above I played around all combinations of values, for mem-fraction-static
I tried from 0.1 to 0.9 with/without tensor parallelism but didn't work.
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
Hey!
I m trying launching a sglang server with OpenBioLLM 70b with the command
python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000
but I got on the 2 issues:python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache
errors out, I tried decreasing the mem-fraction-static or try different values with tp but still fails, here is the error========= Remote Traceback (1) ========= Traceback (most recent call last): File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 369, in _dispatch_request res = self._HANDLERS[handler](self, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 863, in _handle_call return obj(args, *dict(kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 76, in init self.model_runner = ModelRunner( ^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 285, in init self.load_model() File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 323, in load_model model = model_class( ^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 257, in init self.model = LlamaModel(config, quant_config=quant_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 217, in init [ File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 218, in
LlamaDecoderLayer(config, i, quant_config=quant_config)
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 166, in init
self.mlp = LlamaMLP(
^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 39, in init
self.gate_up_proj = MergedColumnParallelLinear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 333, in init
super().init(input_size, sum(output_sizes), bias, gather_output,
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 236, in init
self.quant_method.create_weights(self,
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights
weight = Parameter(torch.empty(output_size_per_partition,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__
return func( args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU
Initialization failed. detoken_init_state: init ok goodbye ('127.0.0.1', 57702) goodbye ('127.0.0.1', 37206) goodbye ('127.0.0.1', 55900) goodbye ('127.0.0.1', 47836) goodbye ('127.0.0.1', 54770) goodbye ('127.0.0.1', 37120) goodbye ('127.0.0.1', 38382) goodbye ('127.0.0.1', 55860)`
python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10007 server started on [0.0.0.0]:10004 server started on [0.0.0.0]:10005 server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10006 server started on [0.0.0.0]:10009 server started on [0.0.0.0]:10010 server started on [0.0.0.0]:10011 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 44596) with fd 46 welcome ('127.0.0.1', 44596) accepted ('127.0.0.1', 44648) with fd 33 welcome ('127.0.0.1', 44648) accepted ('127.0.0.1', 53648) with fd 24 welcome ('127.0.0.1', 53648) accepted ('127.0.0.1', 33128) with fd 25 welcome ('127.0.0.1', 33128) accepted ('127.0.0.1', 41686) with fd 25 welcome ('127.0.0.1', 41686) accepted ('127.0.0.1', 56570) with fd 25 welcome ('127.0.0.1', 56570) accepted ('127.0.0.1', 48382) with fd 34 welcome ('127.0.0.1', 48382) accepted ('127.0.0.1', 36272) with fd 29 welcome ('127.0.0.1', 36272) Rank 4: load weight begin. Rank 6: load weight begin. Rank 2: load weight begin. Rank 5: load weight begin. Rank 3: load weight begin. Rank 7: load weight begin. Rank 1: load weight begin. Rank 0: load weight begin. ^C
ConnectionRefusedError Traceback (most recent call last) File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args) 1347 try: -> 1348 h.request(req.get_method(), req.selector, req.data, headers, 1349 encode_chunked=req.has_header('Transfer-encoding')) 1350 except OSError as err: # timeout error
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1286, in HTTPConnection.request(self, method, url, body, headers, encode_chunked) 1285 """Send a complete request to the server.""" -> 1286 self._send_request(method, url, body, headers, encode_chunked)
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1332, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked) 1331 body = _encode(body, 'body') -> 1332 self.endheaders(body, encode_chunked=encode_chunked)
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1281, in HTTPConnection.endheaders(self, message_body, encode_chunked) 1280 raise CannotSendHeader() -> 1281 self._send_output(message_body, encode_chunked=encode_chunked)
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1041, in HTTPConnection._send_output(self, message_body, encode_chunked) 1040 del self._buffer[:] -> 1041 self.send(msg) 1043 if message_body is not None: 1044 ... -> 1351 raise URLError(err) 1352 r = h.getresponse() 1353 except:
URLError: <urlopen error [Errno 111] Connection refused>