Closed harshita-meena closed 7 months ago
@harshita-meena maxRetryTimeoutInSec
is the duration for which the worker will be attempted to be restarted. The delay between restart attempts made within the maxRetryTimeoutInSec
follows backoff(in seconds) and is specified here:
https://github.com/pytorch/serve/blob/13d092c002114e5f28d92ac8ad4f21a1a56f2f1a/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java#L51-L52
Although, maxRetryTimeoutInSec
is set to 100
seconds, it is still possible that multiple attempts are made within that duration to restart the worker.
While debugging your handler, if you'd like for the worker to be attempted to be started only once and give up on failure you could set the maxRetryTimeoutInSec
to 0
. This way, no retry attempts will be made. Note that, this not recommended configurations for a production setting.
@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.
curl http://localhost:81/models/toy-ranker
[
{
"modelName": "toy-ranker",
"modelVersion": "2024-03-21-15:57",
"modelUrl": "toy-ranker.mar",
"runtime": "python",
"minWorkers": 4,
"maxWorkers": 4,
"batchSize": 1,
"maxBatchDelay": 100,
"loadedAtStartup": true,
......
But this resolves my question, and make sense to not use it in prod.
@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler from ts.torch_handler.base_handler import BaseHandler
) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?
@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.
curl http://localhost:81/models/toy-ranker [ { "modelName": "toy-ranker", "modelVersion": "2024-03-21-15:57", "modelUrl": "toy-ranker.mar", "runtime": "python", "minWorkers": 4, "maxWorkers": 4, "batchSize": 1, "maxBatchDelay": 100, "loadedAtStartup": true, ......
But this resolves my question, and make sense to not use it in prod.
Some of the model configuration options are not shown in the describe model API response, for ex: maxRetryTimeoutInSec
but I'd like to confirm that including the configuration say maxRetryTimeoutInSec: 100
in your model-config.yaml
will apply the configuration to the model. Created a follow up issue to track update of describe model API response to include all model configuration values: https://github.com/pytorch/serve/issues/3037
@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler
from ts.torch_handler.base_handler import BaseHandler
) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?
Although we could filter on specific errors to ignore a worker restart, I believe it may be challenging to come up with a comprehensive list of errors to decide which ones to restart the worker on and which ones to ignore.
To keep the implementation simple as it is currently and address the core issue here, which is as you pointed out, on first worker failure, the entire traceback is printed out whereas for subsequent retires, the full traceback does not show up in the logs, making it difficult to find the actual error in the logs. Here's a potential fix to the issue to log the entire traceback on worker retries: https://github.com/pytorch/serve/pull/3036
Thankyou so much for creating the issues! @namannandan. Understand the complexities with customization of error retries. Agree that solving the core issue of traceback should be good enough for debugging purpose. I will close this issue, thankyou again for your prompt replies.
🐛 Describe the bug
Not able to find a way to disable or reduce timeout for retries. I tried to set a value maxRetryTimeoutInSec to a lower number of 100 seconds from 5mins as mentioned here in the config yaml for my model.
The curl command does not show this value along with min and max workers.
Error logs
No error logs as I just want to reduce the time for model reload.
Installation instructions
N/A
Model Packaing
N/A
config.properties
No response
Versions
Environment headers
Torchserve branch:
torchserve==0.9.0 torch-model-archiver==0.9.0b20240221
Python version: 3.9 (64-bit runtime) Python executable: /Users/hmeena/development/poc/venv/bin/python
Versions of relevant python libraries: captum==0.6.0 numpy==1.24.3 numpyencoder==0.3.0 pillow==10.2.0 psutil==5.9.5 pygit2==1.13.3 pylint==3.0.3 pytest==7.3.1 pytest-cov==4.1.0 pytest-mock==3.12.0 requests==2.31.0 requests-oauthlib==1.3.1 requests-toolbelt==1.0.0 torch==2.2.0 torch-model-archiver==0.9.0b20240221 torch-model-archiver-nightly==2023.11.21 torch-workflow-archiver==0.2.11b20240221 torch-workflow-archiver-nightly==2023.11.21 torchaudio==2.2.0 torchdata==0.7.1 torchpippy==0.1.1 torchserve==0.9.0 torchserve-nightly==2023.11.21 torchtext==0.17.0 torchvision==0.17.0 transformers==4.38.0 wheel==0.42.0 torch==2.2.0 torchtext==0.17.0 torchvision==0.17.0 torchaudio==2.2.0
Java Version:
OS: Mac OSX 11.7.8 (x86_64) GCC version: N/A Clang version: 12.0.0 (clang-1200.0.32.29) CMake version: version 3.23.2
Versions of npm installed packages: **Warning: newman, newman-reporter-html markdown-link-check not installed...
Repro instructions
Setting maxRetryTimeoutInSec: 100 in model config does not show up in model info.
Possible Solution
Retry worker/model reload only for specific failures, right now the model reloads even with syntax failures. A deployment looks ready instead of showing failures. Also the main reason need to be scrolled back to the first failure and finding the actual reason for failure got tricky.