vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.45k stars 3.17k forks source link

[Feature]: vLLM does not work with Hugging Face ZeroGPU Spaces #3510

Open mgoin opened 3 months ago

mgoin commented 3 months ago

Information about HF ZeroGPU Spaces can be found here: https://huggingface.co/zero-gpu-explorers

The environment and code for this issue is kept fully within this Hugging Face space, specifically the app.py for the expected working code for being able to run a chat with vLLM: https://huggingface.co/spaces/mgoin/vllm-zero-gpu

Your current environment

Interestingly, it seems like ZeroGPU Spaces really don't have GPUs available at startup. This is clearly an issue :)

Collecting environment information...
/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14) 12.2.0
Clang version: Could not collect
CMake version: version 3.25.1
Libc version: glibc-2.36

Python version: 3.10.13 (main, Mar 12 2024, 12:16:25) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-5.10.192-183.736.amzn2.x86_64-x86_64-with-glibc2.36
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             96
On-line CPU(s) list:                0-95
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
CPU family:                         6
Model:                              85
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          2
Stepping:                           7
BogoMIPS:                           5999.98
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1.5 MiB (48 instances)
L1i cache:                          1.5 MiB (48 instances)
L2 cache:                           48 MiB (48 instances)
L3 cache:                           71.5 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-23,48-71
NUMA node1 CPU(s):                  24-47,72-95
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

🐛 Describe the bug

The ZeroGPU project claims that:

ZeroGPU Spaces should mostly be compatible with any PyTorch-based GPU Space. Compatibilty with high level HF libraries like transformers or diffusers is slightly more guaranteed That said, ZeroGPU Spaces are not as broadly compatible as classical GPU Spaces and you might still encounter unexpected bugs

The benefit of working well with ZeroGPU is that you can now get access to free GPUs for live vLLM spaces on HF, rather than paying an hourly price to host your vLLM demo. Currently they are using A100s so there are definitely capable GPUs available. The complexity of using this comes from the fact that this uses a sort of serverless or work-sharing structure where the GPU is quickly taken and released based on the application function call. It seems that vLLM breaks this contract with ZeroGPU because it directly allocates workers to devices using torch.cuda.set_device(self.device) during model load.

Because vLLM carefully allocates and manages GPU memory, it may be fundamentally incompatible with what ZeroGPU requires in order to provide GPUs for free for demos. Still, it's worth opening an issue since it was be convienent if it was a small fix and others may encounter this as the project ramps up.

Here is the output of the HF Space when trying to load a model, you can clearly see the CUDA must not be initialized in the main process on Spaces with Stateless GPU environment. error:

INFO 03-19 21:22:02 llm_engine.py:87] Initializing an LLM engine with config: model='NousResearch/Hermes-2-Pro-Mistral-7B', tokenizer='NousResearch/Hermes-2-Pro-Mistral-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/home/user/app/app.py", line 34, in <module>
    model = LLM(model_id, max_model_len=MAX_INPUT_TOKEN_LENGTH)
  File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 391, in from_engine_args
    engine = cls(*engine_configs,
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 128, in __init__
    self._init_workers()
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 180, in _init_workers
    self._run_workers("init_model")
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 85, in init_model
    torch.cuda.set_device(self.device)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/torch.py", line 133, in _cuda_init_raise
    raise RuntimeError(
RuntimeError: CUDA must not be initialized in the main process on Spaces with Stateless GPU environment.
You can look at this Stacktrace to find out which part of your code triggered a CUDA init
simon-mo commented 3 months ago

Interesting. I would say this is not a bug. Rather something creative need to be figure out to make vLLM's assumption of exclusive GPU access compatible with sharing. Potential candidate includes treating the block table as swappable/virtual spaces.