sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.27k stars 545 forks source link

[Bug] cannot import name 'CachedGrammarCompiler' from 'xgrammar' (version 0.3.6) #2166

Closed Quang-elec44 closed 3 days ago

Quang-elec44 commented 4 days ago

Checklist

Describe the bug

ImportError: cannot import name 'CachedGrammarCompiler' from 'xgrammar' (/usr/local/lib/python3.10/dist-packages/xgrammar/init.py)

Reproduction

services:
  llm-sglang-dev:
    image: lmsysorg/sglang:latest
    container_name: llm-sglang-dev
    restart: unless-stopped
    environment:
      HUGGING_FACE_HUB_TOKEN: <my-hf-token>
    ports:
      - "8007:8007"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]
    ipc: host
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    env_file: 
      - .env
    command: >
      python3 -m sglang.launch_server
      --model Qwen/Qwen2.5-7B-Instruct-AWQ
      --host 0.0.0.0
      --port 8007
      --api-key <my-api-key>
      --served-model-name gpt-4o
      --tensor-parallel-size 1
      --mem-fraction-static 0.8
      --random-seed 42
      --enable-p2p-check
      --show-time-cost
      --quantization awq_marlin
      --enable-cache-report
      --grammar-backend xgrammar
      --context-length 4096
import openai

from pydantic import BaseModel

client = openai.OpenAI(
    base_url="http://localhost:8007/v1",
    api_key="my-api-key"
)

class Players(BaseModel):
    names: list[str]

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Give me twenty football players' name and return the result as a JSON object with the key is `names`"}
    ],
    temperature=0.0,
    max_tokens=256,
    extra_body={
        "response_format": {
            "type": "json_schema",
            "json_schema": {
                "name": Players.__name__,
                "schema": Players.model_json_schema()
        }
    }
    }
)
print(completion.choices[0].message.content)

Environment

Python: 3.10.15 (main, Sep  7 2024, 18:35:33) [GCC 9.4.0]
CUDA available: True
GPU 0: NVIDIA A10G
GPU 0 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.120
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu121torch2.4
triton: 3.1.0
transformers: 4.46.3
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.7
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
psutil: 6.1.0
pydantic: 2.10.1
multipart: 0.0.17
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.55.0
anthropic: 0.39.0
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576
merrymercy commented 3 days ago

Try v0.3.6.post1. Fixed by #2176