pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.16k stars 841 forks source link

Worker thread stuck in die state #1815

Closed hgong-snap closed 1 year ago

hgong-snap commented 2 years ago

🐛 Describe the bug

We encountered similar problems as https://github.com/pytorch/serve/issues/1531 and it happens quite often.

See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, it's kinda stuck in an unknown state: it didn't terminate itself, so no workers can be added automatically. But in the meantime, it won't receive incoming traffic. which essentially means we only have one worker(9001) now.

The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it can't receive any traffic. It still counted one active worker, thus torchserve won't add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found Retry worker: 9001 in 1 seconds. in the log )

If I curl the management API, it still shows two works are all healthy.

Error logs

worker-9000 died because exception. It didn't have any log after 2022-08-25 21:21:14.056 PDT, selected logs:

[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]

[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected. [INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 "gRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0" 13 109 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr

Screen Shot 2022-08-25 at 10 10 38 PM Screen Shot 2022-08-25 at 10 09 24 PM

Installation instructions

N/A unrelated

Model Packaing

N/A unrelated

config.properties

No response

Versions

used docker image of 0.6.0-gpu

Repro instructions

N/A

Possible Solution

No response

hgong-snap commented 2 years ago

Another example: similar thing, worker W-9000-easyocr died, and GPU usage drop to almost half (2 workers, 1 worker died stuck, only 1 work is functioning).

If I ssh into the pod and do the curl for the model:

root@ad-ranker-pytorch-ranker-usc-a-588566f68-dt8lj:/home/model-server# curl localhost:17201/models/easyocr
[
  {
    "modelName": "easyocr",
    "modelVersion": "1.0",
    "modelUrl": "easyocr.mar",
    "runtime": "python",
    "minWorkers": 2,
    "maxWorkers": 2,
    "batchSize": 5,
    "maxBatchDelay": 25,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2022-08-26T06:26:26.166Z",
        "status": "READY",
        "memoryUsage": 2979250176,
        "pid": 1042,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::63 % utilization.memory [%]::20 % memory.used [MiB]::5755 MiB"
      },
      {
        "id": "9001",
        "startTime": "2022-08-26T06:26:26.168Z",
        "status": "READY",
        "memoryUsage": 2992365568,
        "pid": 1043,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::63 % utilization.memory [%]::20 % memory.used [MiB]::5755 MiB"
      }
    ]
  }
]
Screen Shot 2022-08-26 at 12 04 45 AM Screen Shot 2022-08-25 at 11 59 16 PM
hgong-snap commented 2 years ago

With that said,

  1. I can't know if a worker is in die stuck state or not.
  2. Even if I know a worker is died, I can't know which one died, nor kill/restart this specific worker

so basically there's really no way to catch the worker die stuck error?

hgong-snap commented 2 years ago

Had quite a few similar issues recently, and I think in general all issues seems pointing to some race condition when GRPC got cancelled for some reason(maybe from client? or timeout? ) and corresponding worker will stuck in a die-ish(zombie) state

msaroufim commented 2 years ago

Hi @hgong-snap is there any chance you can share a repro? In particular your .mar file and config.properties - otherwise I'll see if this occurs on smaller toy models

msaroufim commented 2 years ago

An update I tried to repro this offline

Setup

git clone https://github.com/pytorch/serve
mkdir model_store
torchserve --start --model-store model_store
python -m grpc_tools.protoc --proto_path=frontend/server/src/main/resources/proto/ --python_out=ts_scripts --grpc_python_out=ts_scripts frontend/server/src/main/resources/proto/inference.proto frontend/server/src/main/resources/proto/management.proto

Get patched client

Use this patched version of torchserve_grpc_client.py https://gist.github.com/msaroufim/8a9f7e8a0a58ff5687c4266677c10fe3

Get input data

I picked one of the slowest models I know of from docs/model_zoo.md and then ran 10000 inferences in a row

sample_text.txt

Bloomberg has decided to publish a new report on global economic situation.

Run test

for i in $(seq 0 1 10000)
do 
    python ts_scripts/torchserve_grpc_client.py infer bert_seqc_without_torchscript sample_text.txt
    echo $i
done

EDIT: I tried just running Control C a few times and if you look at the torchserve logs outputted in the console where you type in torchserve --start you should now see the backend thread worker died

2022-09-07T11:12:14,853 [WARN ] W-9000-bert_seqc_without_torchscript_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
        at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
        at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
hgong-snap commented 2 years ago

Hi @msaroufim here's mar file and config file:

mar file: https://drive.google.com/file/d/1GaNfIhvAZn-7hFE1BlfDyTXajrFAAWuR/view?usp=sharing config.properties:

service_envelope=body

grpc_inference_port=17100
grpc_management_port=17101

inference_address=http://0.0.0.0:17200
management_address=http://0.0.0.0:17201
metrics_address=http://127.0.0.1:17202

default_workers_per_model=2
install_py_dep_per_model=true
load_models=all
# async_logging=true
job_queue_size=500
default_response_timeout=5
unregister_model_timeout=5
models={\
  "easyocr": {\
    "1.0": {\
      "defaultVersion": true,\
      "marName": "easyocr.mar",\
      "minWorkers": 2,\
      "maxWorkers": 2,\
      "batchSize": 5,\
      "maxBatchDelay": 25,\
      "responseTimeout": 2000\
    }\
  },\
  "suggestive_timesformer": {\
    "0.01": {\
      "defaultVersion": true,\
      "marName": "suggestive_timesformer.mar",\
      "minWorkers": 1,\
      "maxWorkers": 1,\
      "batchSize": 8,\
      "maxBatchDelay": 250,\
      "responseTimeout": 5000\
    }\
  }\
}
hgong-snap commented 2 years ago

for repro, I guess best way is to cancel the request on client side? or set some short-enough GRPC timeout so that internally GRPC will cancel the request?

msaroufim commented 2 years ago

@hgong-snap confirming we can finally repro, will get back to you with a solution soon

MohamedAliRashad commented 2 years ago

I have the same problem, Any ideas ?

MHatemAbdelhamid commented 2 years ago

Same Issue with only 1 worker

The model works correctly for a while then the worker stops with the below error, then it just tries to start a new worker and fails again for a while


 2022-09-08T11:44:09,800 [INFO ] epollEventLoopGroup-5-121 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STOPPED
2022-09-08T11:44:09,800 [WARN ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-unified_salient_and_beit_1.0-stderr
2022-09-08T11:44:09,800 [WARN ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-unified_salient_and_beit_1.0-stdout
2022-09-08T11:44:09,800 [INFO ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-09-08T11:44:09,998 [INFO ] W-9000-unified_salient_and_beit_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-unified_salient_and_beit_1.0-stderr
2022-09-08T11:44:09,998 [INFO ] W-9000-unified_salient_and_beit_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-unified_salient_and_beit_1.0-stdout
2022-09-08T11:44:10,801 [DEBUG] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/conda/bin/python3.8, /opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000]
lxning commented 2 years ago

@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data.

MohamedAliRashad commented 2 years ago

@lxning But the model does work in a normal state, only in stress testing does this happen.

MHatemAbdelhamid commented 2 years ago

@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data.

It works normally on normal condition the error only happens on large number of users, suddenly when the number of users increases it goes into infinite loop of trying to create the worker then failing

lxning commented 2 years ago

@MHatemAbdelhamid According to your description ("the error only happens on large number of users, suddenly when the number of users increases"). It seems it is a capacity issue. I suggest you find the error log to identify the root cause. I believe the issue in this ticket is different from the issue you faced.

hgong-snap commented 2 years ago

Hi @lxning @msaroufim ,

thanks for the quick fix. unfortunately https://github.com/pytorch/serve/pull/1854 seems not fully mitigate the issue.

I build the image with latest master with ./build_image.sh -g [reference]and then deployed. Similar error happened:

It matches well with observed GPU memory usage

Screen Shot 2022-09-13 at 7 44 14 PM
hgong-snap commented 2 years ago

we currently implemented a workaround for GKE that if there's no worker available just restart the pod(restart torchserve). as you can see this error can happen very often(8pm-11pm). #worker 2->1->0->restart->2->1->0, repeat....

Screen Shot 2022-09-14 at 11 40 31 PM
hgong-snap commented 2 years ago

@msaroufim @lxning

I can successfully repro it in my local with following setup

setup

import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2

parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True)
parser.add_argument("--port", required=True)
parser.add_argument("-n", required=True)

if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=float(args.timeout))
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)

Result

it takes torchserve several seconds to several minutes to resolve the issue and tab1 simulation's output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors.

MohamedAliRashad commented 2 years ago

@hgong-snap This is precisely the problem I faced.

@lxning Any thoughts ?

lxning commented 2 years ago

@hgong-snap I tried the steps you provided. There is no exception or died workerthread in the ts_log.log.zip. Message "grpc client call already cancelled" was logged when TS was trying to send response but timeout on client side.

The reason of the tab1 with timeout=2 slow recovery is that the tab2 with timeout=0.02 has 100x rate than tab1. It means there are 100x requests from tab2 accumulated in TS internal job queue. These jobs are still processed by TS even though client already cancels the connection. This scenario gives you the wrong impression that worker dies and takes long time to recover.

Meanwhile, I filed a #1863 to optimize Torchserve performance.

hgong-snap commented 2 years ago

@lxning I updated my script to accept another parameter --sleep_time so that it can configure how much time client should wait/sleep between requests. new client script:

import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2

parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True, type=float)
parser.add_argument("--port", required=True, type=int)
parser.add_argument("--sleep_time", default=0, type=float)
parser.add_argument("-n", required=True, type=int)

if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=args.timeout)
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)
        if args.sleep_time:
            time.sleep(args.sleep_time)

I record a video to illustrate the problem. some good timestamp to look at:

https://drive.google.com/file/d/1zNlFTX6h2AO_DVQHgwZmbHClKy1zcfOI/view?usp=sharing

lxning commented 2 years ago

@hgong-snap Thank you for recording the workflow, I can see the following exception in the video.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
        at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
        at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

After the PR, the above exception should be gone. Instead, only warning "grpc client call already cancelled" is shown in the log (eg. ts_log.log.zip). Could you please check if you can see such warning in your log? If not, most likely you are still using the old source code.

Here is an example of building a docker based on master.

cd docker
./build_image.sh -bt dev
hgong-snap commented 2 years ago

@lxning hmm I'm pretty sure I built the image on latest master though... however just in case, would you mind uploading the docker you use so that I can try on my end? (better with GPU support)

lxning commented 2 years ago

@hgong-snap I verified master branch on both local host and docker nightly build.

You can fetch torchserve nightly build at https://hub.docker.com/r/pytorch/torchserve-nightly/tags.

hgong-snap commented 2 years ago

@lxning thanks. I tried to build the dev image locally with

./build_image.sh -bt dev -g

after this completed successfully, I run

docker run -it --rm  pytorch/torchserve:dev-gpu

but has error with:

Traceback (most recent call last):
  File "/usr/bin/torchserve", line 33, in <module>
    sys.exit(load_entry_point('torchserve==0.6.0b20220919', 'console_scripts', 'torchserve')())
  File "/usr/bin/torchserve", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/usr/lib/python3.8/importlib/metadata.py", line 503, in distribution
    return Distribution.from_name(distribution_name)
  File "/usr/lib/python3.8/importlib/metadata.py", line 177, in from_name
    raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: torchserve

Any hint why torchserve python package not installed in the docker image?

lxning commented 2 years ago

@hgong-snap gpu docker image should specify cuda version. For example: ./build_image.sh -bt dev -g -cv cu102

Here is the detail information about torchserve docker image build.

Could you please directly pull torchserve docker nightly build to quickly verify, which is based on cu102? eg. docker pull pytorch/torchserve-nightly:latest-gpu

hgong-snap commented 1 year ago

@lxning verified onpytorch/torchserve-nightly:latest-gpu and seems issue is solved!

lxning commented 1 year ago

@hgong-snap Great, thank you for the verification.

hgong-snap commented 1 year ago

quick question:

in order to run torchserve with the fix locally (e.g. mac) without docker, will

pip install torchserve-nightly

be sufficient?

msaroufim commented 1 year ago

Yep, don't forget to run ts_scripts/install_dependencies first though with dev dependencies and GPU if needed

arnavmehta7 commented 1 year ago

Hi, I am still getting this issue. I have built the image through build.sh