sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Apache License 2.0
2.74k stars 176 forks source link

Llava CUDA error: device-side assert triggered #543

Open dmilcevski opened 2 weeks ago

dmilcevski commented 2 weeks ago

I am trying to deploy llava-v1.6-34b on A100 80GB but am getting the following error:

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [395,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [395,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
2024-06-13 20:58:40 | ERROR | srt.tp_worker | Exception in ModelTpServer:
Traceback (most recent call last):
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 188, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 204, in forward_step
    self.forward_fill_batch(new_batch)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/sglang/python/sglang/srt/managers/controller/model_runner.py", line 422, in forward
    return self.forward_extend_multi_modal(batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/model_runner.py", line 411, in forward_extend_multi_modal
    return self.model.forward(
  File "/sglang/python/sglang/srt/models/llava.py", line 110, in forward
    .cpu()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2024-06-13 20:58:40 | ERROR | srt.controller | Exception in ControllerSingle:
Traceback (most recent call last):
  File "/sglang/python/sglang/srt/managers/controller/manager_single.py", line 93, in start_controller_process
    loop.run_until_complete(controller.loop_for_forward())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/sglang/python/sglang/srt/managers/controller/manager_single.py", line 44, in loop_for_forward
    out_pyobjs = await self.model_client.step(next_step_input)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 753, in _func
    return f(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 188, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 204, in forward_step
    self.forward_fill_batch(new_batch)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/sglang/python/sglang/srt/managers/controller/model_runner.py", line 422, in forward
    return self.forward_extend_multi_modal(batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/model_runner.py", line 411, in forward_extend_multi_modal
    return self.model.forward(
  File "/sglang/python/sglang/srt/models/llava.py", line 110, in forward
    .cpu()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Does anybody have an idea how to fix the issue? Thanks

dmilcevski commented 2 weeks ago

There were many hanging processes so I needed to kill them and re-deploy slang again. However, now I get a different issue, again coming from the llava implementation:

2024-06-14 08:36:05 | ERROR | srt.tp_worker | Exception in ModelTpServer:
Traceback (most recent call last):
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 188, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 204, in forward_step
    self.forward_fill_batch(new_batch)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/sglang/python/sglang/srt/managers/controller/model_runner.py", line 422, in forward
    return self.forward_extend_multi_modal(batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/model_runner.py", line 411, in forward_extend_multi_modal
    return self.model.forward(
  File "/sglang/python/sglang/srt/models/llava.py", line 105, in forward
    input_embeds = self.language_model.model.embed_tokens(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 100, in forward
    output_parallel = F.embedding(masked_input, self.weight)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2264, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2024-06-14 08:36:05 | ERROR | srt.controller | Exception in ControllerSingle:
Traceback (most recent call last):
  File "/sglang/python/sglang/srt/managers/controller/manager_single.py", line 93, in start_controller_process
    loop.run_until_complete(controller.loop_for_forward())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/sglang/python/sglang/srt/managers/controller/manager_single.py", line 44, in loop_for_forward
    out_pyobjs = await self.model_client.step(next_step_input)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 753, in _func
    return f(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 188, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 204, in forward_step
    self.forward_fill_batch(new_batch)
  File "/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/sglang/python/sglang/srt/managers/controller/model_runner.py", line 422, in forward
    return self.forward_extend_multi_modal(batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sglang/python/sglang/srt/managers/controller/model_runner.py", line 411, in forward_extend_multi_modal
    return self.model.forward(
  File "/sglang/python/sglang/srt/models/llava.py", line 105, in forward
    input_embeds = self.language_model.model.embed_tokens(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 100, in forward
    output_parallel = F.embedding(masked_input, self.weight)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2264, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Anybody ideas how to fix this? Thanks

taroshi commented 1 week ago

1 used one gpu card is ok, but two has the same problem

dmilcevski commented 1 week ago

I explicitly restricted the access to 1 GPU with CUDA_VISIBLE_DEVICES=0. I do have more GPUs on the node, but it should only use this device, plus I am getting this in the logs, so it means it uses one device:

2024-06-12 08:03:55 | INFO | srt.model_runner | [gpu_id=0] Set cuda device.
2024-06-12 08:03:55 | INFO | srt.model_runner | [gpu_id=0] Init nccl begin.
2024-06-12 08:03:56 | INFO | srt.model_runner | [gpu_id=0] Load weight begin. avail mem=78.59 GB