runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
238 stars 96 forks source link

Errors cause the instance to run indefinitely #29

Open gabewillen opened 10 months ago

gabewillen commented 10 months ago

Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several running that have an error.

alpayariyak commented 9 months ago

Are you still facing this issue currently?

dannysemi commented 9 months ago

I had this issue yesterday. Used up all of my credits overnight.

bartlettD commented 9 months ago

I've seen this as well, but more from the perspective that if vllm runs into an error then the worker continues to retry the job over and over.

I can get this to happen if I do the following

  1. Try load a model with a larger context size that will fit in memory.
  2. Send a request.
  3. Container logs show vllm quits with an out of memory error.
  4. Vllm restarts the job, fails again.
  5. Repeat
ashleykleynhans commented 9 months ago

This is not a VLLM specific thing, this happens when my other workers get errors too, they just keep running over and over and spawning more and more workers until you scale your workers down to zero. This seems to be some kind of issue with the backend or the RunPod SDK.

gabewillen commented 9 months ago

This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue.

dannysemi commented 9 months ago

I'm going to try polling the health check for retries and cancel the job if I get more than one or two retries.

alpayariyak commented 9 months ago

@bartlettD Could you provide an example model and GPU model please?

preemware commented 8 months ago

Same problem. Entire balance was wiped from

2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
raise EnvironmentError(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
resolved_config_file = cached_file(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config = AutoConfig.from_pretrained(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config
2024-02-09 21:00:26.434

using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .

preemware commented 8 months ago

This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue.

Link? Because I've lost a lot of money from trying to use this one.

alpayariyak commented 8 months ago

Same problem. Entire balance was wiped from


2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

raise EnvironmentError(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

resolved_config_file = cached_file(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config = AutoConfig.from_pretrained(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config

2024-02-09 21:00:26.434

using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .

Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution.

It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker.

preemware commented 8 months ago

Same problem. Entire balance was wiped from


2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

raise EnvironmentError(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

resolved_config_file = cached_file(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config = AutoConfig.from_pretrained(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config

2024-02-09 21:00:26.434

using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .

Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution.

It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker.

It should exit on exception. That isn't impossible to implement. This used to work perfectly for a long time when only using VLLM's generate. The code should be tested before being tagged as a release.

alpayariyak commented 8 months ago

@anthonyllx The issue is that Serverless will keep restarting the worker despite it breaking or raising an exception. The same would happen even when only using vLLM's generate, since you need to start the vLLM engine to use generate, which is where the exception occurs.

The latest commit fixes the error you're facing, thank you for reporting it.

alpayariyak commented 8 months ago

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

preemware commented 8 months ago

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

Thank you. This would solve the problem.

willsamu commented 8 months ago

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

@alpayariyak When will this be introduced? I cannot find a setting to configure it in the UI. I'm somewhat afraid to use serverless endpoints in prod scenarios until this is solved.

avacaondata commented 7 months ago

@gabewillen Could you please provide a link to the repo implementing the TGI custom worker?

dpkirchner commented 6 months ago

@alpayariyak Just checking to see if this feature is now available and if so how to enable it? Is it an environment variable?

DireLines commented 6 months ago

The cause for this is identified and we are implementing a fix for it which should be out by end of next week. For now, you should know that this error will always happen when the handler code exits before running runpod.serverless.start(handler), which in turn mostly happens because of some error in the initialization phase. For example, in the stack trace you posted @preemware the error happened during initialization of the vllm engine because of some missing config on the model.

The fix is for runpod's backend to monitor the handler process for completion and terminate the pod if that process completes either successfully or unsuccessfully.

willsamu commented 5 months ago

@DireLines Thank you for the update. Is it implemented now? How does with work together with Flashboot enabled? For example, for me a Mixtral finetune ran just fine on an RTX 6000 for dozens of requests until suddenly during initialization with Flashboot, it threw an error to be out of memory (due to kv_cache filling up If i remember correctly).

Does that mean, we need to wrap the vllm initialization phase in a try-catch block and continue successfully, so that it will only fail once it reaches the handler?

7flash commented 4 months ago

I also have this issue, balance wiped out

@dannysemi how did you implement health check?

Permafacture commented 4 months ago

@DireLines any update?

DireLines commented 4 months ago

It took longer than expected but logic flagging workers that fail during initialization as unhealthy is done, and will be activated in the next release for one of our repos. It's already deployed but only logging to us when it happens, so we can see that it behaves as expected before flipping the switch.

Once released, workers that are flagged in this way will be shown as "unhealthy" on the serverless UI, and automatically stopped and then removed from the endpoint. New ones will scale up to take their place, which means the money drain is slowed but not stopped. This is because a failure during initialization can happen because of a temporary outage for a dependency needed at import time as well, and we don't want a temporary outage to turn into a permanent one. In a later iteration, we will implement better retry logic so that the money drain will be stopped completely, and figure out some alerting/notification so you as the maintainer of an endpoint can know when failures of this type happen.

Thanks for your patience, this is definitely a bad behavior for serverless to exhibit and not at all an intended UX. I hope this prevents similar problems to what you've experienced in the future.

DireLines commented 4 months ago

This change is now released