sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
Apache License 2.0
5.74k stars 463 forks source link

Failure to start - P100/Cuda 6.1 incompatibility? #1059

Closed dirkson closed 2 months ago

dirkson commented 2 months ago

Upon attempting to run version 0.2.12, it fails with the attached error. I tried a few flags, but none changed the error: --disable-flashinfer , --disable-flashinfer-sampling , --disable-cuda-graph . Given that the final line complains about a kernel image being unavailable, I suspect that there's a requirement for some particular version of cuda or some particular GPUs in the code, but I haven't been able to find that documented anywhere, and the error message itself is not exactly clear.

If I'm right about what the issue is, support for P100's/Cuda 6.1 would be appreciated as a feature request - They're literally the most economical way to run large models right now by a large margin. If p100's/Cuda6.1 are already supposed to be supported, I'd be eager to help track down whatever bug is happening here.

It wasn't clear to me what actual tests would help, so I didn't dig into it very deeply. I'm happy to try building the software myself, or install models beyond the one gptq model I happened to have downloaded. Just let me know what you need!

` Exception in ControllerSingle: Traceback (most recent call last): File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/managers/controller_single.py", line 166, in start_controller_process controller.loop_for_forward() File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/managers/controller_single.py", line 103, in loop_for_forward out_pyobjs = self.tp_server.exposed_step(recv_reqs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 222, in exposed_step self.forward_step() File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 238, in forward_step self.forward_prefill_batch(new_batch) File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 452, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 397, in forward return self.forward_extend(batch) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 373, in forward_extend return self.model.forward( ^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 314, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 281, in forward hidden_states, residual = layer( ^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 228, in forward hidden_states = self.input_layernorm(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 13, in forward return self._forward_method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/sglang/srt/layers/layernorm.py", line 45, in forward_cuda out = rmsnorm(x, self.weight.data, self.variance_epsilon) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/flashinfer/norm.py", line 52, in rmsnorm return _kernels.rmsnorm(input, weight, eps) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device

`

zhyncs commented 2 months ago

Sorry. We primarily focus on sm80+ and sm75.

dirkson commented 2 months ago

Please consider this a feature request, then. This hardware is literally the cheapest, most accessible way to run inference right now, so I'm of the opinion that it's the most important hardware to focus on. I'd also suggest you put the current restriction in your documentation, so that you don't get duplicate feature requests, then.

zhyncs commented 2 months ago

It makes nonsense for me.