sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.08k stars 354 forks source link

[Bug] schedule_batch.py: IndexError: list index out of range #1189

Open Quang-elec44 opened 3 weeks ago

Quang-elec44 commented 3 weeks ago

Checklist

Describe the bug

Exception in ModelTpServer: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 218, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in forward_step self.forward_decode_batch(self.running_batch) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 595, in forward_decode_batch jump_forward_reqs = batch.check_for_jump_forward(self.model_runner) File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 599, in check_for_jump_forward if not req.jump_forward_and_retokenize( File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 269, in jump_forward_and_retokenize if all_ids[prompt_tokens - 1] != self.origin_input_ids_unpadded[-1]: IndexError: list index out of range

While my QA/QC member tested the model, the server got this error. Unfortunately I cannot reproduce the bug, but I hope you can guess the reason.

Reproduction

MODEL=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 TP=1 MFS=0.6

docker run -d --gpus '"device=1"' \ -p 8002:8002 \ --rm \ --network=$NETWORK \ --name sglang-server \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env HF_TOKEN=$HF_TOKEN \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path $MODEL \ --host 0.0.0.0 \ --port 8002 \ --served-model-name $SERVED_MODEL_NAME \ --mem-fraction-static $MFS \ --tensor-parallel-size $TP \ --api-key $API_KEY \ --quantization awq_marlin \ --efficient-weight-load \ --enable-p2p-check

Environment

Python: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] CUDA available: True GPU 0,1: NVIDIA A10G GPU 0,1 Compute Capability: 8.6 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.1, V12.1.105 CUDA Driver Version: 550.90.07 PyTorch: 2.4.0+cu121 flashinfer: 0.1.5+cu121torch2.4 triton: 3.0.0 transformers: 4.44.0 requests: 2.32.3 tqdm: 4.66.5 numpy: 1.26.4 aiohttp: 3.10.3 fastapi: 0.112.1 hf_transfer: 0.1.8 huggingface_hub: 0.24.5 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 6.0.0 pydantic: 2.8.2 uvicorn: 0.30.6 uvloop: 0.20.0 zmq: 26.1.0 vllm: 0.5.4 multipart: 0.0.9 openai: 1.40.8 anthropic: 0.34.0 NVIDIA Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-47 0 N/A GPU1 PHB X 0-47 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

Quang-elec44 commented 3 weeks ago

The server crashed and hanged after getting this error.

merrymercy commented 2 weeks ago

cc @hnyls2002