sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6k stars 496 forks source link

[Bug] cutlass group_gemm.initialize failed #1788

Open senlice opened 2 weeks ago

senlice commented 2 weeks ago

Checklist

Describe the bug

I fine-tuned qwen1.5-7B using LoRA. Then, after starting the service with the following command, I encountered the following problem when making a POST request. BUG:
return self._wrapper.run( RuntimeError: cutlass group_gemm.initialize failed: Error Internal cache[rtype].remove(name) KeyError: '/mp-_k2l7pal'

Reproduction

Service startup command: python -m sglang.launch_server --model Qwen/Qwen1.5-7B-Chat --lora-paths /output_qwen_lora --disable-radix --disable-cuda-graph --max-loras-per-batch 4

post access commands: import json import requests url = "http://127.0.0.1:30000" json_data = { "text": [ "Tell us about yourself", "What you're good at" ], "sampling_params": {"temperature": 0.8, "top_p": 1, "repetition_penalty": 1.05, "max_new_tokens": 200}, "lora_path": [ "/output_qwen_lora", None ] } response = requests.post( url + "/generate", json=json_data, ) print(json.dumps(response.json()))

Environment

aioflask 0.4.0 aiohappyeyeballs 2.4.3 aiohttp 3.10.9 aiosignal 1.3.1 annotated-types 0.7.0 anthropic 0.36.0 anyio 4.6.0 async-timeout 4.0.3 attrs 24.2.0 audioread 3.0.1 blinker 1.8.2 certifi 2024.8.30 cffi 1.17.1 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 compressed-tensors 0.6.0 datasets 3.0.1 decorator 5.1.1 decord 0.6.0 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 einops 0.8.0 et-xmlfile 1.1.0 exceptiongroup 1.2.2 fastapi 0.115.0 filelock 3.16.1 flashinfer 0.1.6+cu121torch2.4 Flask 2.1.3 frozenlist 1.4.1 fsspec 2024.6.1 gguf 0.10.0 greenlet 3.1.1 greenletio 0.11.0 h11 0.14.0 hf_transfer 0.1.8 httpcore 1.0.6 httptools 0.6.1 httpx 0.27.2 huggingface-hub 0.25.1 idna 3.10 importlib_metadata 8.5.0 interegular 0.3.3 itsdangerous 2.2.0 jieba 0.42.1 Jinja2 3.1.4 jiter 0.6.1 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 lark 1.2.2 lazy_loader 0.4 librosa 0.10.2.post1 litellm 1.48.19 llvmlite 0.43.0 lm-format-enforcer 0.10.6 loguru 0.7.2 MarkupSafe 3.0.1 mistral_common 1.4.4 modelscope 1.18.1 mpmath 1.3.0 msgpack 1.1.0 msgspec 0.18.6 multidict 6.1.0 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.3 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 openai 1.51.2 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 orjson 3.10.10 outlines 0.0.46 packaging 24.1 pandas 2.2.3 partial-json-parser 0.2.1.1.post4 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 pooch 1.8.2 prometheus_client 0.21.0 prometheus-fastapi-instrumentator 7.0.0 propcache 0.2.0 protobuf 5.28.2 psutil 6.0.0 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pycountry 24.6.1 pycparser 2.22 pydantic 2.9.2 pydantic_core 2.23.4 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.12 pytz 2024.2 PyYAML 6.0.2 pyzmq 26.2.0 ray 2.37.0 referencing 0.35.1 regex 2024.9.11 requests 2.32.3 rpds-py 0.20.0 safetensors 0.4.5 scikit-learn 1.5.2 scipy 1.14.1 sentencepiece 0.2.0 setuptools 75.1.0 sglang 0.3.4.post1 six 1.16.0 sniffio 1.3.1 soundfile 0.12.1 soxr 0.5.0.post1 starlette 0.38.6 sympy 1.13.3 threadpoolctl 3.5.0 tiktoken 0.7.0 tokenizers 0.20.0 torch 2.4.0 torchao 0.5.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.45.2 triton 3.0.0 typing_extensions 4.12.2 tzdata 2024.2 urllib3 2.2.3 uvicorn 0.31.0 uvloop 0.20.0 vllm 0.6.3.post1 vllm-flash-attn 2.6.1 watchfiles 0.24.0 websockets 13.1 Werkzeug 2.2.2 wheel 0.44.0 xformers 0.0.27.post2 xxhash 3.5.0 yarl 1.14.0 zipp 3.20.2 zmq 0.0.0

trevor-m commented 1 week ago

Hi @senlice I was able to reproduce your issue on a A6000 GPU. What GPU are you using?

I took a look at the code for cutlass grouped_gemm.initialize(). It seems the only reason it can fail with Internal Error is if the kernel needs more shared memory than is available on your GPU. Relevant code here: https://github.com/NVIDIA/cutlass/blob/19f51596e8be9fe87d583616466581ab5740c19d/include/cutlass/gemm/device/base_grouped.h#L375-L384.

The amount of shared memory needed seems to be mostly determined by the ThreadblockShape in cutlass. This is currently set by flashinfer here: https://github.com/flashinfer-ai/flashinfer/blob/4f40420e24d65cabd8be731e12f96a5ef0795a4b/include/flashinfer/gemm/group_gemm.cuh#L83 I suspect this shape will result in a shared memory usage which works fine on A100 or H100, but is too much for your GPU or mine. Probably the solution lies in flashnvinfer to change this parameter depending on which GPU is being used.