vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.32k stars 3.15k forks source link

[Bug]: Disable log requests and disable log stats do not work #6129

Open wufxgtihub123 opened 4 days ago

wufxgtihub123 commented 4 days ago

Your current environment

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 10.1.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.17

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-862.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 535.54.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
座:                 2
NUMA 节点:         2
厂商 ID:           AuthenticAMD
CPU 系列:          23
型号:              49
型号名称:        AMD EPYC 7302 16-Core Processor
步进:              0
CPU MHz:             1500.000
CPU max MHz:           3000.0000
CPU min MHz:           1500.0000
BogoMIPS:            5989.04
虚拟化:           AMD-V
L1d 缓存:          32K
L1i 缓存:          32K
L2 缓存:           512K
L3 缓存:           16384K
NUMA 节点0 CPU:    0-15,32-47
NUMA 节点1 CPU:    16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr ibpb ibrs stibp arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.2
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] sentence-transformers==2.2.2
[pip3] torch==2.1.2
[pip3] torchvision==0.16.2
[pip3] transformers==4.37.0
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==2.1.0
[conda] numpy                     1.26.2                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
[conda] sentence-transformers     2.2.2                    pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] torchvision               0.16.2                   pypi_0    pypi
[conda] transformers              4.37.0                   pypi_0    pypi
[conda] transformers-stream-generator 0.0.4                    pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  PIX SYS SYS 0-15,32-47  0       N/A
GPU1    PIX  X  SYS SYS 0-15,32-47  0       N/A
GPU2    SYS SYS  X  PIX 16-31,48-63 1       N/A
GPU3    SYS SYS PIX  X  16-31,48-63 1       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

When I start the vllm inference service in openai format from the command line “”“CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \ --model /date/pretrained_models/Qwen1.5-14B-Chat \ --trust-remote-code \ --served-model-name qwen7b \ --api-key sk-abcd \ --port 8005 \ --gpu-memory-utilization 0.8 \ --max-model-len 6832 --tensor-parallel-size 2 \ --disable-log-requests \ --disable-log-stats”“”,How can I adjust the issue of continuously generating relevant log files under temp/ray/, which takes up a lot of disk space, to prevent LLM from generating any log files through this command line to save space? I really don't want to record any log files

hmellor commented 3 days ago

--disable-log-stats and --disable-log-requests to not disable all logging, they disable the logging of stats and request contents respectively. They work as intended.

temp/ray is written to by ray. This is disabled by setting ray.init(log_to_driver=False), which would be done here:

https://github.com/vllm-project/vllm/blob/56b325e977435af744f8b3dca7af0ca209663558/vllm/executor/ray_utils.py#L83-L88