Reduce or remove worker retries for specific failures

harshita-meena commented 7 months ago

🐛 Describe the bug

Not able to find a way to disable or reduce timeout for retries. I tried to set a value maxRetryTimeoutInSec to a lower number of 100 seconds from 5mins as mentioned here in the config yaml for my model.

handler:
  schema_file: schema.json
  warmup: warmup.parquet
maxRetryTimeoutInSec: 100
model_name: toy-ranker
version: 2024-03-20-18:11

The curl command does not show this value along with min and max workers.


$ curl http://localhost:81/models/toy-ranker
[
  {
    "modelName": "toy-ranker",
    "modelVersion": "2024-03-20-18:11",
    "modelUrl": "toy-ranker.mar",
    "runtime": "python",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2024-03-20T23:12:05.221Z",
        "status": "READY",
        "memoryUsage": 289484800,
        "pid": 39,
        "gpu": false,
        "gpuUsage": "N/A"
      },
      {
        "id": "9001",
        "startTime": "2024-03-20T23:12:05.223Z",
        "status": "READY",
        "memoryUsage": 288542720,
        "pid": 42,
        "gpu": false,
        "gpuUsage": "N/A"
      },
      {
        "id": "9002",
        "startTime": "2024-03-20T23:12:05.224Z",
        "status": "READY",
        "memoryUsage": 289349632,
        "pid": 41,
        "gpu": false,
        "gpuUsage": "N/A"
      },
      {
        "id": "9003",
        "startTime": "2024-03-20T23:12:05.224Z",
        "status": "READY",
        "memoryUsage": 288591872,
        "pid": 40,
        "gpu": false,
        "gpuUsage": "N/A"
      }
    ],
    "jobQueueStatus": {
      "remainingCapacity": 1000,
      "pendingRequests": 0
    }
  }
]

Error logs

No error logs as I just want to reduce the time for model reload.

Installation instructions

N/A

Model Packaing

N/A

config.properties

No response

Versions

Environment headers

Torchserve branch:

torchserve==0.9.0 torch-model-archiver==0.9.0b20240221

Python version: 3.9 (64-bit runtime) Python executable: /Users/hmeena/development/poc/venv/bin/python

Versions of relevant python libraries: captum==0.6.0 numpy==1.24.3 numpyencoder==0.3.0 pillow==10.2.0 psutil==5.9.5 pygit2==1.13.3 pylint==3.0.3 pytest==7.3.1 pytest-cov==4.1.0 pytest-mock==3.12.0 requests==2.31.0 requests-oauthlib==1.3.1 requests-toolbelt==1.0.0 torch==2.2.0 torch-model-archiver==0.9.0b20240221 torch-model-archiver-nightly==2023.11.21 torch-workflow-archiver==0.2.11b20240221 torch-workflow-archiver-nightly==2023.11.21 torchaudio==2.2.0 torchdata==0.7.1 torchpippy==0.1.1 torchserve==0.9.0 torchserve-nightly==2023.11.21 torchtext==0.17.0 torchvision==0.17.0 transformers==4.38.0 wheel==0.42.0 torch==2.2.0 torchtext==0.17.0 torchvision==0.17.0 torchaudio==2.2.0

Java Version:

OS: Mac OSX 11.7.8 (x86_64) GCC version: N/A Clang version: 12.0.0 (clang-1200.0.32.29) CMake version: version 3.23.2

Versions of npm installed packages: **Warning: newman, newman-reporter-html markdown-link-check not installed...

Repro instructions

Setting maxRetryTimeoutInSec: 100 in model config does not show up in model info.

Possible Solution

Retry worker/model reload only for specific failures, right now the model reloads even with syntax failures. A deployment looks ready instead of showing failures. Also the main reason need to be scrolled back to the first failure and finding the actual reason for failure got tricky.

namannandan commented 7 months ago

@harshita-meena maxRetryTimeoutInSec is the duration for which the worker will be attempted to be restarted. The delay between restart attempts made within the maxRetryTimeoutInSec follows backoff(in seconds) and is specified here: https://github.com/pytorch/serve/blob/13d092c002114e5f28d92ac8ad4f21a1a56f2f1a/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java#L51-L52

Although, maxRetryTimeoutInSec is set to 100 seconds, it is still possible that multiple attempts are made within that duration to restart the worker.

While debugging your handler, if you'd like for the worker to be attempted to be started only once and give up on failure you could set the maxRetryTimeoutInSec to 0. This way, no retry attempts will be made. Note that, this not recommended configurations for a production setting.

harshita-meena commented 7 months ago

@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.

curl http://localhost:81/models/toy-ranker
[
  {
    "modelName": "toy-ranker",
    "modelVersion": "2024-03-21-15:57",
    "modelUrl": "toy-ranker.mar",
    "runtime": "python",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
......

But this resolves my question, and make sense to not use it in prod.

harshita-meena commented 7 months ago

@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler from ts.torch_handler.base_handler import BaseHandler) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?

namannandan commented 7 months ago

@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.
curl http://localhost:81/models/toy-ranker
[
  {
    "modelName": "toy-ranker",
    "modelVersion": "2024-03-21-15:57",
    "modelUrl": "toy-ranker.mar",
    "runtime": "python",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
......
But this resolves my question, and make sense to not use it in prod.

Some of the model configuration options are not shown in the describe model API response, for ex: maxRetryTimeoutInSec but I'd like to confirm that including the configuration say maxRetryTimeoutInSec: 100 in your model-config.yaml will apply the configuration to the model. Created a follow up issue to track update of describe model API response to include all model configuration values: https://github.com/pytorch/serve/issues/3037

namannandan commented 7 months ago

@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler from ts.torch_handler.base_handler import BaseHandler) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?

Although we could filter on specific errors to ignore a worker restart, I believe it may be challenging to come up with a comprehensive list of errors to decide which ones to restart the worker on and which ones to ignore.

To keep the implementation simple as it is currently and address the core issue here, which is as you pointed out, on first worker failure, the entire traceback is printed out whereas for subsequent retires, the full traceback does not show up in the logs, making it difficult to find the actual error in the logs. Here's a potential fix to the issue to log the entire traceback on worker retries: https://github.com/pytorch/serve/pull/3036

harshita-meena commented 7 months ago

Thankyou so much for creating the issues! @namannandan. Understand the complexities with customization of error retries. Agree that solving the core issue of traceback should be good enough for debugging purpose. I will close this issue, thankyou again for your prompt replies.

pytorch / serve