Open gabewillen opened 10 months ago
Are you still facing this issue currently?
I had this issue yesterday. Used up all of my credits overnight.
I've seen this as well, but more from the perspective that if vllm runs into an error then the worker continues to retry the job over and over.
I can get this to happen if I do the following
This is not a VLLM specific thing, this happens when my other workers get errors too, they just keep running over and over and spawning more and more workers until you scale your workers down to zero. This seems to be some kind of issue with the backend or the RunPod SDK.
This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue.
I'm going to try polling the health check for retries and cancel the job if I get more than one or two retries.
@bartlettD Could you provide an example model and GPU model please?
Same problem. Entire balance was wiped from
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
raise EnvironmentError(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
resolved_config_file = cached_file(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config = AutoConfig.from_pretrained(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config
2024-02-09 21:00:26.434
using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .
This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue.
Link? Because I've lost a lot of money from trying to use this one.
Same problem. Entire balance was wiped from
2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files. 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] raise EnvironmentError( 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] resolved_config_file = cached_file( 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] config = AutoConfig.from_pretrained( 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config 2024-02-09 21:00:26.434
using the build command
docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .
Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution.
It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker.
Same problem. Entire balance was wiped from
2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files. 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] raise EnvironmentError( 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] resolved_config_file = cached_file( 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] config = AutoConfig.from_pretrained( 2024-02-09 21:00:26.435 [5hhq44ockiqu67] [info] File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config 2024-02-09 21:00:26.434
using the build command
docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .
Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution.
It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker.
It should exit on exception. That isn't impossible to implement. This used to work perfectly for a long time when only using VLLM's generate. The code should be tested before being tagged as a release.
@anthonyllx The issue is that Serverless will keep restarting the worker despite it breaking or raising an exception. The same would happen even when only using vLLM's generate, since you need to start the vLLM engine to use generate, which is where the exception occurs.
The latest commit fixes the error you're facing, thank you for reporting it.
We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.
We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.
Thank you. This would solve the problem.
We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.
@alpayariyak When will this be introduced? I cannot find a setting to configure it in the UI. I'm somewhat afraid to use serverless endpoints in prod scenarios until this is solved.
@gabewillen Could you please provide a link to the repo implementing the TGI custom worker?
@alpayariyak Just checking to see if this feature is now available and if so how to enable it? Is it an environment variable?
The cause for this is identified and we are implementing a fix for it which should be out by end of next week. For now, you should know that this error will always happen when the handler code exits before running runpod.serverless.start(handler), which in turn mostly happens because of some error in the initialization phase. For example, in the stack trace you posted @preemware the error happened during initialization of the vllm engine because of some missing config on the model.
The fix is for runpod's backend to monitor the handler process for completion and terminate the pod if that process completes either successfully or unsuccessfully.
@DireLines Thank you for the update. Is it implemented now? How does with work together with Flashboot
enabled? For example, for me a Mixtral finetune ran just fine on an RTX 6000 for dozens of requests until suddenly during initialization with Flashboot, it threw an error to be out of memory (due to kv_cache filling up If i remember correctly).
Does that mean, we need to wrap the vllm initialization phase in a try-catch block and continue successfully, so that it will only fail once it reaches the handler?
I also have this issue, balance wiped out
@dannysemi how did you implement health check?
@DireLines any update?
It took longer than expected but logic flagging workers that fail during initialization as unhealthy is done, and will be activated in the next release for one of our repos. It's already deployed but only logging to us when it happens, so we can see that it behaves as expected before flipping the switch.
Once released, workers that are flagged in this way will be shown as "unhealthy" on the serverless UI, and automatically stopped and then removed from the endpoint. New ones will scale up to take their place, which means the money drain is slowed but not stopped. This is because a failure during initialization can happen because of a temporary outage for a dependency needed at import time as well, and we don't want a temporary outage to turn into a permanent one. In a later iteration, we will implement better retry logic so that the money drain will be stopped completely, and figure out some alerting/notification so you as the maintainer of an endpoint can know when failures of this type happen.
Thanks for your patience, this is definitely a bad behavior for serverless to exhibit and not at all an intended UX. I hope this prevents similar problems to what you've experienced in the future.
This change is now released
Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several running that have an error.