sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.05k stars 505 forks source link

[Bug] Run llava 1.5 backend get an error #1985

Closed pspdada closed 14 hours ago

pspdada commented 4 days ago

Checklist

Describe the bug

I'm using the latest version of sglang, I ran the example in this file, but it resulted in an error: ./benchmark/llava_bench/README.md, but it resulted in an error:

[2024-11-10 22:07:08] server_args=ServerArgs(model_path='liuhaotian/llava-v1.6-vicuna-7b', tokenizer_path='llava-hf/llava-1.5-7b-hf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='liuhaotian/llava-v1.6-vicuna-7b', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=8309160, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
[2024-11-10 22:07:18 TP0] Automatically turn off --chunked-prefill-size and adjust --mem-fraction-static for multimodal models.
[2024-11-10 22:07:18 TP0] Init torch distributed begin.
[2024-11-10 22:07:18 TP0] Load weight begin. avail mem=38.97 GB
[2024-11-10 22:07:20 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-11-10 22:07:20 TP0] Ignore import error when loading sglang.srt.models.llava. Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/root/anaconda3/envs/llava/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
[2024-11-10 22:07:20 TP0] Ignore import error when loading sglang.srt.models.llavavid. Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/root/anaconda3/envs/llava/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
[2024-11-10 22:07:20 TP0] Ignore import error when loading sglang.srt.models.yivl. Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/root/anaconda3/envs/llava/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
[2024-11-10 22:07:20 TP0] Traceback (most recent call last):
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1191, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 163, in __init__
    self.tp_worker = TpWorkerClass(
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 55, in __init__
    self.model_runner = ModelRunner(
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 149, in __init__
    self.load_model()
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 253, in load_model
    self.model = get_model(
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 173, in _initialize_model
    model_class, _ = get_model_architecture(model_config)
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 35, in get_model_architecture
    return ModelRegistry.resolve_model_cls(architectures)
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 365, in resolve_model_cls
    model_cls = self._try_load_model_cls(arch)
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 664, in load_model_cls_srt
    raise ValueError(
ValueError: Unsupported architectures: LlavaLlamaForCausalLM. Supported list: ['BaichuanForCausalLM', 'ChatGLMModel', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'Grok1ForCausalLM', 'Grok1ModelForCausalLM', 'InternLM2ForCausalLM', 'LlamaForCausalLM', 'Phi3ForCausalLM', 'LlamaForClassification', 'LlamaEmbeddingModel', 'MistralModel', 'LlamaForSequenceClassification', 'LlamaForSequenceClassificationWithNormal_Weights', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MllamaForConditionalGeneration', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'StableLmForCausalLM', 'TorchNativeLlamaForCausalLM', 'TorchNativePhi3ForCausalLM', 'XverseForCausalLM', 'XverseMoeForCausalLM']

Reproduction

pip3 install "sglang[all]"
pip3 install "torch>=2.1.2" "transformers>=4.36" pillow
python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000

Environment

Python: 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA A100-PCIE-40GB
GPU 0 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda-12.4
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.107.02
PyTorch: 2.4.0+cu124
sglang: 0.3.5
flashinfer: 0.1.6+cu124torch2.4
triton: 3.0.0
transformers: 4.46.2
requests: 2.32.3
tqdm: 4.67.0
numpy: 1.26.4
aiohttp: 3.10.10
fastapi: 0.115.4
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
packaging: 24.2
PIL: 10.4.0
psutil: 6.1.0
pydantic: 2.9.2
uvicorn: 0.32.0
uvloop: 0.21.0
zmq: 26.2.0
vllm: 0.6.3.post1
multipart: 0.0.17
openai: 1.54.3
anthropic: 0.39.0
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-7     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576
merrymercy commented 14 hours ago

The real error is

[2024-11-10 22:07:20 TP0] Ignore import error when loading sglang.srt.models.llava. Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/root/anaconda3/envs/llava/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
[2024-11-10 22:07:20 TP0] Ignore import error when loading sglang.srt.models.llavavid. Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/root/anaconda3/envs/llava/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
[2024-11-10 22:07:20 TP0] Ignore import error when loading sglang.srt.models.yivl. Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/root/anaconda3/envs/llava/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

It is unrelated to sglang. It seems the problem is your flash attention kernels are not compatible with something else. You should be able to reproduce this with transformers only.