Open mgoin opened 7 months ago
Interesting. I would say this is not a bug. Rather something creative need to be figure out to make vLLM's assumption of exclusive GPU access compatible with sharing. Potential candidate includes treating the block table as swappable/virtual spaces.
Hey is the fixed?Are Vllms working now on ZeroGPU?
Information about HF ZeroGPU Spaces can be found here: https://huggingface.co/zero-gpu-explorers
The environment and code for this issue is kept fully within this Hugging Face space, specifically the
app.py
for the expected working code for being able to run a chat with vLLM: https://huggingface.co/spaces/mgoin/vllm-zero-gpuYour current environment
Interestingly, it seems like ZeroGPU Spaces really don't have GPUs available at startup. This is clearly an issue :)
🐛 Describe the bug
The ZeroGPU project claims that:
The benefit of working well with ZeroGPU is that you can now get access to free GPUs for live vLLM spaces on HF, rather than paying an hourly price to host your vLLM demo. Currently they are using A100s so there are definitely capable GPUs available. The complexity of using this comes from the fact that this uses a sort of serverless or work-sharing structure where the GPU is quickly taken and released based on the application function call. It seems that vLLM breaks this contract with ZeroGPU because it directly allocates workers to devices using
torch.cuda.set_device(self.device)
during model load.Because vLLM carefully allocates and manages GPU memory, it may be fundamentally incompatible with what ZeroGPU requires in order to provide GPUs for free for demos. Still, it's worth opening an issue since it was be convienent if it was a small fix and others may encounter this as the project ramps up.
Here is the output of the HF Space when trying to load a model, you can clearly see the
CUDA must not be initialized in the main process on Spaces with Stateless GPU environment.
error: