vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.77k stars 3.92k forks source link

[Bug]: get that Exception in thread Thread-3 (_report_usage_worker): (vllm OpenVINO,When python3 vllm/benchmarks/benchmark_throughput.py,) #6340

Closed HPUedCSLearner closed 2 months ago

HPUedCSLearner commented 2 months ago

Your current environment

The output of `python collect_env.py`

Collecting environment information...
WARNING 07-11 22:54:46 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.3.0+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构:                              x86_64
CPU 运行模式:                      32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
字节序:                            Little Endian
CPU:                                20
在线 CPU 列表:                     0-19
厂商 ID:                           GenuineIntel
型号名称:                          Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
CPU 系列:                          6
型号:                              63
每个核的线程数:                    2
每个座的核数:                      10
座:                                1
步进:                              2
CPU 最大 MHz:                      3500.0000
CPU 最小 MHz:                      1200.0000
BogoMIPS:                          5786.88
标记:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l1d
虚拟化:                            VT-x
L1d 缓存:                          320 KiB (10 instances)
L1i 缓存:                          320 KiB (10 instances)
L2 缓存:                           2.5 MiB (10 instances)
L3 缓存:                           25 MiB (1 instance)
NUMA 节点:                         1
NUMA 节点0 CPU:                    0-19
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnx==1.16.1
[pip3] torch==2.3.0+cpu
[pip3] transformers==4.42.3
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.3.0+cpu                pypi_0    pypi
[conda] transformers              4.42.3                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

🐛 Describe the bug

start from : https://docs.vllm.ai/en/latest/getting_started/openvino-installation.html

Then use the follow command, I get a Exception.

VLLM_OPENVINO_KVCACHE_SPACE=10  \
VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8  \
VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
     python3 vllm/benchmarks/benchmark_throughput.py \
        --input-len 128 --output-len 128 \
        --num_prompts 100

The log :

(vllm-openvino) feng@feng-X99M-D3:~/llm/vllm$ VLLM_OPENVINO_KVCACHE_SPACE=10  VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON     python3 vllm/benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num_prompts 100

WARNING 07-11 22:58:38 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto')

INFO 07-11 22:58:39 llm_engine.py:174] Initializing an LLM engine (v0.5.1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 07-11 22:58:40 openvino_executor.py:132] Only float32 dtype is supported on OpenVINO, casting from torch.float16.
WARNING 07-11 22:58:40 openvino_executor.py:137] CUDA graph is not supported on OpenVINO backend, fallback to the eager mode.
INFO 07-11 22:58:40 openvino_executor.py:146] KV cache type is overried to u8 via VLLM_OPENVINO_CPU_KV_CACHE_PRECISION env var.
INFO 07-11 22:58:40 openvino_executor.py:159] OpenVINO optimal block size is 32, overriding currently set 16
INFO 07-11 22:58:43 selector.py:124] Cannot use _Backend.FLASH_ATTN backend on OpenVINO.
INFO 07-11 22:58:43 selector.py:69] Using OpenVINO Attention backend.
DEBUG 07-11 22:58:43 parallel_state.py:803] world_size=1 rank=0 local_rank=-1 distributed_init_method=tcp://192.168.2.23:40293 backend=gloo
WARNING 07-11 22:58:44 openvino.py:123] Provided model id facebook/opt-125m does not contain OpenVINO IR, the model will be converted to IR with default options. If you need to use specific options for model conversion, use optimum-cli export openvino with desired options.
Framework not specified. Using pt to export the model.

Using framework PyTorch: 2.3.0+cpu
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
Overriding 1 configuration item(s)
    - use_cache -> True
/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:824: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  elif attention_mask.shape[1] != mask_seq_length:
/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/site-packages/optimum/exporters/onnx/model_patcher.py:303: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if past_key_values_length > 0:
/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/site-packages/optimum/bettertransformer/models/attention.py:285: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if batch_size == 1 or self.training:
/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/site-packages/optimum/bettertransformer/models/attention.py:299: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (batch_size, self.num_heads, tgt_len, self.head_dim):
['input_ids', 'attention_mask', 'past_key_values']
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 100% (74 / 74)              │ 100% (74 / 74)                         │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 74/74 • 0:00:00 • 0:00:00
INFO 07-11 22:59:01 openvino_executor.py:72] # CPU blocks: 16181
INFO 07-11 22:59:01 selector.py:124] Cannot use _Backend.FLASH_ATTN backend on OpenVINO.
INFO 07-11 22:59:01 selector.py:69] Using OpenVINO Attention backend.

Processed prompts:   0%|                                                                                                                      | 0/100 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]Exception in thread Thread-3 (_report_usage_worker):
Traceback (most recent call last):
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/site-packages/vllm/usage/usage_lib.py", line 140, in _report_usage_worker
    self._report_usage_once(model_architecture, usage_context, extra_kvs)
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/site-packages/vllm/usage/usage_lib.py", line 179, in _report_usage_once
    self._write_to_file(data)
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/site-packages/vllm/usage/usage_lib.py", line 206, in _write_to_file
    json.dump(data, f)
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/home/feng/miniconda3/envs/vllm-openvino/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Type is not JSON serializable

Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:19<00:00,  5.09it/s, est. speed input: 656.58 toks/s, output: 651.49 toks/s]
Throughput: 5.08 requests/s, 1300.91 tokens/s
mgoin commented 2 months ago

@ilya-lavrenov @helena-intel can you look into this?

helena-intel commented 2 months ago

@HPUedCSLearner thanks for the clear report! I see the same issue with the other benchmarking scripts. The issue is with writing the usage_stats.json file. The benchmarking itself should still work fine. We will create a PR to fix this issue. In the meantime, to prevent the Exception, you can workaround this by editing https://github.com/vllm-project/vllm/blob/main/vllm/usage/usage_lib.py and remove the non-json-serializable object from the data:

--- a/vllm/usage/usage_lib.py
+++ b/vllm/usage/usage_lib.py
@@ -200,6 +200,7 @@ class UsageMessage:
             logging.debug("Failed to send usage data to server")

     def _write_to_file(self, data):
+        data.pop("kv_cache_dtype")
         os.makedirs(os.path.dirname(_USAGE_STATS_JSON_PATH), exist_ok=True)
         Path(_USAGE_STATS_JSON_PATH).touch(exist_ok=True)
         with open(_USAGE_STATS_JSON_PATH, "a") as f: