runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
213 stars 81 forks source link

Do the new images work? #51

Closed dannysemi closed 6 months ago

dannysemi commented 6 months ago

I based my custom worker on the vllm-base image base-0.3.0-cuda12.1.0 but if I try to run it with multiple gpus I get this error:

"ImportError: NCCLBackend is not available. Please install cupy."

It works fine if I'm only using one gpu. I saw this comment on the sls-worker repo at Dockerfile:

# We used base cuda image because pytorch installs its own cuda libraries.
# However cupy depends on cuda libraries so we had to switch to the runtime image
# In the future it would be nice to get a container with pytorch and cuda without duplicating cuda
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04 AS vllm-base

Was this Dockerfile used to create the base image?

I can see cupy listed in the requirements.txt too.

alpayariyak commented 6 months ago

Hey, could you try adding the installation of cupy with the specified version into the worker Dockerfile?

dannysemi commented 6 months ago

Hey, could you try adding the installation of cupy with the specified version into the worker Dockerfile?

Oh that makes sense. I'll try that now.

dannysemi commented 6 months ago

I got the same error. I must be doing something wrong. Does this work in your testing?

alpayariyak commented 6 months ago

I haven't tested multi-gpu yet, will be working on fixing this later today

alpayariyak commented 6 months ago

Seems like the quickest way to fix this, if following in vLLM's steps, will be to use the runtime image instead of base like they did in the snippet you shared, but the issue is that it would make the container 1.5GB heavier

dannysemi commented 6 months ago

Not ideal, but it looks like they didn't have a better solution either.

Also, they added lora support I guess. https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/serving_chat.py#L25

I'm just ignoring it for now with this change: https://github.com/dannysemi/worker-vllm/commit/d90b012a5f7ad78c483ccad7718c4d54c5931d3f#diff-865c0134cc71b80102811a3e1216d5dad097594d0a4cabcb4dfc077d925af689R200

I didn't see a change for that here.

It's weird because it says it is Optional, but it doesn't run unless I specify None.

alpayariyak commented 6 months ago

It won't run without that change because of the arguments being positional, so what happens is that chat template is passed as lora_modules argument. The new base images are meant for 0.3.0, where that change is already included.

If you're interested, the openai-sse-output branch (soon to be released as 0.3.0) has OpenAI compatibility, support for Gemma, general improvements and new features - the docs are up-to-date to get you started. You'll have to enforce eager there too until I fix the multi-gpu issue, but it's available as an env variable.

dannysemi commented 6 months ago

Awesome! Thanks

dannysemi commented 6 months ago

If you're interested, the openai-sse-output branch (soon to be released as 0.3.0) has OpenAI compatibility, support for Gemma, general improvements and new features - the docs are up-to-date to get you started. You'll have to enforce eager there too until I fix the multi-gpu issue, but it's available as an env variable.

I just tried out the new OpenAI compatibility and it works great. Good job!

alpayariyak commented 6 months ago

Thank you, great to hear that! :)

Let me know if you run into any issues or have feature requests

alpayariyak commented 6 months ago

Fixed Multi-GPU in 0.3.0 release :)

dannysemi commented 6 months ago

Thanks again. If I was already using this image in a region, will it pull the fixed image without the version being bumped?

alpayariyak commented 6 months ago

Only if the workers are deleted/restarted. We check if there are any updates to an image when initializing workers