sgl-project / sglang

SGLang is yet another fast serving framework for large language models and vision language models.
Apache License 2.0
2.9k stars 190 forks source link

../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed. #473

Open pseudotensor opened 2 months ago

pseudotensor commented 2 months ago
python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port=30020 --host="0.0.0.0" --tp-size=1 --random-seed=1234 --context-length=4096 &> 34b.log &

client:

pload = {'text': '<|im_start|>system\nAnswer the questions.<|im_end|>user<image>\nGive detailed information.<|im_end|>', 'sampling_params': {'max_new_tokens': 1024, 'temperature': 0.0, 'top_p': 1.0, 'presence_penalty': 0.14000000000000012, 'frequency_penalty': 2, 'stop': ['<|im_end|>']}, 'stream': False}

The pload also has the image in bytes form, e.g.:

data:image/png;base64,iVBORw0KGgoAAAANSU...

For this image:

bigben

client code:

        response = requests.post(
            url,
            json=pload,
            stream=False,
        )

stream False or True doesn't help. url is just 'http://xxx.xxx.xxx.xxx:80/generate'.

The conv_chatml_direct was used to construct the above.

I get the same problem if I remove the sampling parameters.

This model works perfectly well on original llava worker-server-gradio setup, but has tons of issues with sglang. This includes no response or total failure on the server. This isn't just random issue, it happens repeatedly always and constantly, rare that things work.

Other models like llama3 work perfectly fine with same code.

The error:

INFO:     172.16.0.20:22926 - "GET /health HTTP/1.1" 200 OK
INFO:     172.16.0.20:18448 - "GET /health HTTP/1.1" 200 OK
INFO:     172.16.0.20:42702 - "GET /health HTTP/1.1" 200 OK
INFO:     172.16.0.20:1276 - "GET /health HTTP/1.1" 200 OK
INFO:     172.16.0.20:26522 - "GET /health HTTP/1.1" 200 OK
INFO:     172.16.0.20:19358 - "GET /health HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 9. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 49.92%.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [40,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [19,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [19,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
... etc.
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 175, in exposed_step
    self.forward_step()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 190, in forward_step
    self.forward_fill_batch(new_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 418, in forward_fill_batch
    ) = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 404, in forward
    return self.forward_extend_multi_modal(batch)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 393, in forward_extend_multi_modal
    return self.model.forward(
  File "/home/ubuntu/sglang/python/sglang/srt/models/llava.py", line 106, in forward
    .cpu()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

etc.  this repeats forever and server is dead.

Related? https://github.com/sgl-project/sglang/issues/461

pseudotensor commented 2 months ago

This is using latest release on pypi or built off main.

pseudotensor commented 1 month ago

Any hope here? It constantly ends up bombing.

dmilcevski commented 1 month ago

I get the same error using version v0.1.17 installed from source and from pipe.