sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.27k stars 545 forks source link

[Bug] Error when using LLAVA 1.5 for llava bench #2140

Closed pspdada closed 4 days ago

pspdada commented 5 days ago

Checklist

Describe the bug

I want to follow https://github.com/sgl-project/sglang/blob/main/benchmark/llava_bench/README.md and perform batch inference on llava. First I launch a llava v1.5 7b model from the local path using: python3 -m sglang.launch_server --model-path /root/llm-project/utils/models/models-repo/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000 --disable-cuda-graph Every thing look fine:

[2024-11-23 20:21:11] server_args=ServerArgs(model_path='/root/llm-project/utils/models/models-repo/llava-v1.5-7b', tokenizer_path='llava-hf/llava-1.5-7b-hf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/root/llm-project/utils/models/models-repo/llava-v1.5-7b', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=415803303, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. 
Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. 
[2024-11-23 20:21:21 DP-1 TP0] Automatically turn off --chunked-prefill-size and adjust --mem-fraction-static for multimodal models.
[2024-11-23 20:21:21 DP-1 TP0] Init torch distributed begin.
[2024-11-23 20:21:21 DP-1 TP0] Load weight begin. avail mem=38.97 GB
[2024-11-23 20:21:22 DP-1 TP0] lm_eval is not installed, GPTQ may not be usable
Loading pt checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
/root/anaconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/model_loader/weight_utils.py:425: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.43it/s]
Loading pt checkpoint shards:  67% Completed | 2/3 [00:29<00:16, 16.97s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [01:36<00:00, 39.92s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [01:36<00:00, 32.09s/it]

[2024-11-23 20:23:10 DP-1 TP0] Load weight end. type=LlavaLlamaForCausalLM, dtype=torch.float16, avail mem=25.72 GB
[2024-11-23 20:23:10 DP-1 TP0] Memory pool end. avail mem=6.28 GB
Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. 
[2024-11-23 20:23:12 DP-1 TP0] max_total_num_tokens=39590, max_prefill_tokens=16384, max_running_requests=4097, context_len=4096
[2024-11-23 20:23:12] INFO:     Started server process [169184]
[2024-11-23 20:23:12] INFO:     Waiting for application startup.
[2024-11-23 20:23:12] INFO:     Application startup complete.
[2024-11-23 20:23:12] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2024-11-23 20:23:13] INFO:     127.0.0.1:44342 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-11-23 20:23:13 DP-1 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-23 20:23:15] INFO:     127.0.0.1:44354 - "POST /generate HTTP/1.1" 200 OK
[2024-11-23 20:23:15] The server is fired up and ready to roll!

Then I run the llava_bench using python3 bench_sglang.py --num-questions 60, an error occured: The server side:

[2024-11-23 20:20:03] INFO:     127.0.0.1:52700 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-11-23 20:20:03 DP-1 TP0] Prefill batch. #new-seq: 1, #new-token: 33, #cached-token: 1, cache hit rate: 2.44%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-23 20:20:03] INFO:     127.0.0.1:52704 - "POST /generate HTTP/1.1" 200 OK
[2024-11-23 20:20:06 DP-1 TP0] Traceback (most recent call last):
  File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1407, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 388, in event_loop_overlap
    self.process_input_requests(recv_reqs)
  File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 493, in process_input_requests
    self.handle_generate_request(recv_req)
  File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 556, in handle_generate_request
    req.origin_input_ids = self.pad_input_ids_func(
  File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/models/llava.py", line 67, in pad_input_ids
    num_patch_width, num_patch_height = get_anyres_image_grid_shape(
  File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/mm_utils.py", line 173, in get_anyres_image_grid_shape
    possible_resolutions = ast.literal_eval(grid_pinpoints)
  File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 110, in literal_eval
    return _convert(node_or_string)
  File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 109, in _convert
    return _convert_signed_num(node)
  File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 83, in _convert_signed_num
    return _convert_num(node)
  File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 74, in _convert_num
    _raise_malformed_node(node)
  File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 71, in _raise_malformed_node
    raise ValueError(msg + f': {node!r}')
ValueError: malformed node or string: None

zsh: killed     python3 -m sglang.launch_server --model-path  --tokenizer-path  --port 30000

The python code (running bench_sglang.py) side:

uncher 58987 -- /root/llm-project/sglang/benchmark/llava_bench/bench_sglang.py 
  0%|                                                                                                                                                                                                                    | 0/60 [00:00<?, ?it/s]/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py:339: UserWarning: Error in stream_executor: Traceback (most recent call last):
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py", line 337, in _thread_worker_func
    self._execute(expr)
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py", line 380, in _execute
    self._execute(x)
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py", line 375, in _execute
    self._execute_gen(other)
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py", line 502, in _execute_gen
    comp, meta_info = self.backend.generate(
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/backend/runtime_endpoint.py", line 163, in generate
    res = http_request(
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/utils.py", line 100, in http_request
    resp = urllib.request.urlopen(req, data=data, cafile=verify)
  File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 1377, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 1352, in do_open
    r = h.getresponse()
  File "/root/anaconda3/envs/llava/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/root/anaconda3/envs/llava/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/root/anaconda3/envs/llava/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

  warnings.warn(f"Error in stream_executor: {get_exception_traceback()}")
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:03<00:00, 17.06it/s]
Latency: 3.821
Write output to answers.jsonl

Reproduction

Follow https://github.com/sgl-project/sglang/blob/main/benchmark/llava_bench/README.md

Environment

Python: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0]
CUDA available: True
GPU 0,1: NVIDIA A100-PCIE-40GB
GPU 0,1 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda-12.4
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.107.02
PyTorch: 2.4.0+cu121
sglang: 0.3.6
flashinfer: 0.1.6+cu124torch2.4
triton: 3.0.0
transformers: 4.46.3
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.115.4
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
psutil: 5.9.0
pydantic: 2.9.2
multipart: 0.0.17
zmq: 25.1.2
uvicorn: 0.32.0
uvloop: 0.21.0
vllm: 0.6.3.post1
openai: 1.54.3
anthropic: 0.39.0
NVIDIA Topology: 
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     0-7     0               N/A
GPU1    PHB      X      0-7     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576
I1123 14:38:57.196000 139823990548288 torch/_dynamo/utils.py:335] TorchDynamo compilation metrics:
I1123 14:38:57.196000 139823990548288 torch/_dynamo/utils.py:335] Function    Runtimes (s)
I1123 14:38:57.196000 139823990548288 torch/_dynamo/utils.py:335] ----------  --------------
V1123 14:38:57.196000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats constrain_symbol_range: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats evaluate_expr: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _simplify_floor_div: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_guard_rel: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _find: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats has_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats size_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats simplify: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _update_divisible: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats replace: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_evaluate_static: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_implications: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.197000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_axioms: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V1123 14:38:57.198000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats safe_expand: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V1123 14:38:57.198000 139823990548288 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats uninteresting_files: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
merrymercy commented 4 days ago

It is not maintained anymore. Please try newer llava v1.6/ llava next /llava onevision. https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/quick_start/local_example_llava_next.py https://github.com/sgl-project/sglang/tree/main/examples/runtime/llava_onevision https://github.com/sgl-project/sglang/blob/731146f6cbec40f502e16dc971a150ed46b207ad/test/srt/test_vision_openai_server.py#L31