Closed hgong-snap closed 1 year ago
Another example: similar thing, worker W-9000-easyocr
died, and GPU usage drop to almost half (2 workers, 1 worker died stuck, only 1 work is functioning).
If I ssh into the pod and do the curl for the model:
root@ad-ranker-pytorch-ranker-usc-a-588566f68-dt8lj:/home/model-server# curl localhost:17201/models/easyocr
[
{
"modelName": "easyocr",
"modelVersion": "1.0",
"modelUrl": "easyocr.mar",
"runtime": "python",
"minWorkers": 2,
"maxWorkers": 2,
"batchSize": 5,
"maxBatchDelay": 25,
"loadedAtStartup": true,
"workers": [
{
"id": "9000",
"startTime": "2022-08-26T06:26:26.166Z",
"status": "READY",
"memoryUsage": 2979250176,
"pid": 1042,
"gpu": true,
"gpuUsage": "gpuId::0 utilization.gpu [%]::63 % utilization.memory [%]::20 % memory.used [MiB]::5755 MiB"
},
{
"id": "9001",
"startTime": "2022-08-26T06:26:26.168Z",
"status": "READY",
"memoryUsage": 2992365568,
"pid": 1043,
"gpu": true,
"gpuUsage": "gpuId::0 utilization.gpu [%]::63 % utilization.memory [%]::20 % memory.used [MiB]::5755 MiB"
}
]
}
]
With that said,
so basically there's really no way to catch the worker die stuck error?
Had quite a few similar issues recently, and I think in general all issues seems pointing to some race condition when GRPC got cancelled for some reason(maybe from client? or timeout? ) and corresponding worker will stuck in a die-ish(zombie) state
Hi @hgong-snap is there any chance you can share a repro? In particular your .mar
file and config.properties
- otherwise I'll see if this occurs on smaller toy models
An update I tried to repro this offline
git clone https://github.com/pytorch/serve
mkdir model_store
torchserve --start --model-store model_store
python -m grpc_tools.protoc --proto_path=frontend/server/src/main/resources/proto/ --python_out=ts_scripts --grpc_python_out=ts_scripts frontend/server/src/main/resources/proto/inference.proto frontend/server/src/main/resources/proto/management.proto
Use this patched version of torchserve_grpc_client.py
https://gist.github.com/msaroufim/8a9f7e8a0a58ff5687c4266677c10fe3
I picked one of the slowest models I know of from docs/model_zoo.md
and then ran 10000 inferences in a row
sample_text.txt
Bloomberg has decided to publish a new report on global economic situation.
for i in $(seq 0 1 10000)
do
python ts_scripts/torchserve_grpc_client.py infer bert_seqc_without_torchscript sample_text.txt
echo $i
done
EDIT: I tried just running Control C a few times and if you look at the torchserve logs outputted in the console where you type in torchserve --start
you should now see the backend thread worker died
2022-09-07T11:12:14,853 [WARN ] W-9000-bert_seqc_without_torchscript_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Hi @msaroufim here's mar file and config file:
mar file: https://drive.google.com/file/d/1GaNfIhvAZn-7hFE1BlfDyTXajrFAAWuR/view?usp=sharing config.properties:
service_envelope=body
grpc_inference_port=17100
grpc_management_port=17101
inference_address=http://0.0.0.0:17200
management_address=http://0.0.0.0:17201
metrics_address=http://127.0.0.1:17202
default_workers_per_model=2
install_py_dep_per_model=true
load_models=all
# async_logging=true
job_queue_size=500
default_response_timeout=5
unregister_model_timeout=5
models={\
"easyocr": {\
"1.0": {\
"defaultVersion": true,\
"marName": "easyocr.mar",\
"minWorkers": 2,\
"maxWorkers": 2,\
"batchSize": 5,\
"maxBatchDelay": 25,\
"responseTimeout": 2000\
}\
},\
"suggestive_timesformer": {\
"0.01": {\
"defaultVersion": true,\
"marName": "suggestive_timesformer.mar",\
"minWorkers": 1,\
"maxWorkers": 1,\
"batchSize": 8,\
"maxBatchDelay": 250,\
"responseTimeout": 5000\
}\
}\
}
for repro, I guess best way is to cancel the request on client side? or set some short-enough GRPC timeout so that internally GRPC will cancel the request?
@hgong-snap confirming we can finally repro, will get back to you with a solution soon
I have the same problem, Any ideas ?
Same Issue with only 1 worker
The model works correctly for a while then the worker stops with the below error, then it just tries to start a new worker and fails again for a while
2022-09-08T11:44:09,800 [INFO ] epollEventLoopGroup-5-121 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STOPPED
2022-09-08T11:44:09,800 [WARN ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-unified_salient_and_beit_1.0-stderr
2022-09-08T11:44:09,800 [WARN ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-unified_salient_and_beit_1.0-stdout
2022-09-08T11:44:09,800 [INFO ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-09-08T11:44:09,998 [INFO ] W-9000-unified_salient_and_beit_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-unified_salient_and_beit_1.0-stderr
2022-09-08T11:44:09,998 [INFO ] W-9000-unified_salient_and_beit_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-unified_salient_and_beit_1.0-stdout
2022-09-08T11:44:10,801 [DEBUG] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/conda/bin/python3.8, /opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000]
@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data.
@lxning But the model does work in a normal state, only in stress testing does this happen.
@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data.
It works normally on normal condition the error only happens on large number of users, suddenly when the number of users increases it goes into infinite loop of trying to create the worker then failing
@MHatemAbdelhamid According to your description ("the error only happens on large number of users, suddenly when the number of users increases"). It seems it is a capacity issue. I suggest you find the error log to identify the root cause. I believe the issue in this ticket is different from the issue you faced.
Hi @lxning @msaroufim ,
thanks for the quick fix. unfortunately https://github.com/pytorch/serve/pull/1854 seems not fully mitigate the issue.
I build the image with latest master with ./build_image.sh -g
[reference]and then deployed. Similar error happened:
[18:41:01] W-9001 throws exception:
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
W-9000 was fine still, but finally died at 18:55
It matches well with observed GPU memory usage
we currently implemented a workaround for GKE that if there's no worker available just restart the pod(restart torchserve). as you can see this error can happen very often(8pm-11pm). #worker 2->1->0->restart->2->1->0, repeat....
@msaroufim @lxning
I can successfully repro it in my local with following setup
config.properties
, I put min/max worker=2import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2
parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True)
parser.add_argument("--port", required=True)
parser.add_argument("-n", required=True)
if __name__ == '__main__':
args, _ = parser.parse_known_args()
port = args.port
num_runs = int(args.n)
with open('image.jpg', 'rb') as f:
data = f.read()
input_data = {'data': data}
request = inference_pb2.PredictionsRequest(model_name="easyocr",
input=input_data)
channel = grpc.insecure_channel(f'localhost:{port}')
stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
for i in range(num_runs):
try:
start = datetime.now()
response = stub.Predictions(request, timeout=float(args.timeout))
print("request time:", datetime.now() - start)
print(response)
except Exception as e:
print(e)
python3 torchserve_grpc_client.py --port=17100 --timeout=2 -n=100000000
python3 torchserve_grpc_client.py --port=17100 --timeout=0.02 -n=100
it takes torchserve several seconds to several minutes to resolve the issue and tab1 simulation's output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors.
@hgong-snap This is precisely the problem I faced.
@lxning Any thoughts ?
@hgong-snap I tried the steps you provided. There is no exception or died workerthread in the ts_log.log.zip. Message "grpc client call already cancelled" was logged when TS was trying to send response but timeout on client side.
The reason of the tab1 with timeout=2 slow recovery is that the tab2 with timeout=0.02 has 100x rate than tab1. It means there are 100x requests from tab2 accumulated in TS internal job queue. These jobs are still processed by TS even though client already cancels the connection. This scenario gives you the wrong impression that worker dies and takes long time to recover.
Meanwhile, I filed a #1863 to optimize Torchserve performance.
@lxning I updated my script to accept another parameter --sleep_time
so that it can configure how much time client should wait/sleep between requests.
new client script:
import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2
parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True, type=float)
parser.add_argument("--port", required=True, type=int)
parser.add_argument("--sleep_time", default=0, type=float)
parser.add_argument("-n", required=True, type=int)
if __name__ == '__main__':
args, _ = parser.parse_known_args()
port = args.port
num_runs = int(args.n)
with open('image.jpg', 'rb') as f:
data = f.read()
input_data = {'data': data}
request = inference_pb2.PredictionsRequest(model_name="easyocr",
input=input_data)
channel = grpc.insecure_channel(f'localhost:{port}')
stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
for i in range(num_runs):
try:
start = datetime.now()
response = stub.Predictions(request, timeout=args.timeout)
print("request time:", datetime.now() - start)
print(response)
except Exception as e:
print(e)
if args.sleep_time:
time.sleep(args.sleep_time)
I record a video to illustrate the problem. some good timestamp to look at:
0:10 I send bad requests in tab2 with
python3 grpc_client/torchserve_grpc_client2.py --port=17100 --timeout=0.02 -n=10 --sleep_time=1
so that QPS=1 now.
0:16 tab1 seeing worker issue, either INTERNAL, either DEADLINE_EXCEEDED
0:34 tab1 recovered(lucky)
0:45 I send bad request in tab2 again
tab1 never recovered since then. as time of writing tab1 still has 0 success ( >15 min now).
https://drive.google.com/file/d/1zNlFTX6h2AO_DVQHgwZmbHClKy1zcfOI/view?usp=sharing
@hgong-snap Thank you for recording the workflow, I can see the following exception in the video.
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
After the PR, the above exception should be gone. Instead, only warning "grpc client call already cancelled" is shown in the log (eg. ts_log.log.zip). Could you please check if you can see such warning in your log? If not, most likely you are still using the old source code.
Here is an example of building a docker based on master.
cd docker
./build_image.sh -bt dev
@lxning hmm I'm pretty sure I built the image on latest master though... however just in case, would you mind uploading the docker you use so that I can try on my end? (better with GPU support)
@hgong-snap I verified master branch on both local host and docker nightly build.
You can fetch torchserve nightly build at https://hub.docker.com/r/pytorch/torchserve-nightly/tags.
@lxning thanks. I tried to build the dev image locally with
./build_image.sh -bt dev -g
after this completed successfully, I run
docker run -it --rm pytorch/torchserve:dev-gpu
but has error with:
Traceback (most recent call last):
File "/usr/bin/torchserve", line 33, in <module>
sys.exit(load_entry_point('torchserve==0.6.0b20220919', 'console_scripts', 'torchserve')())
File "/usr/bin/torchserve", line 22, in importlib_load_entry_point
for entry_point in distribution(dist_name).entry_points
File "/usr/lib/python3.8/importlib/metadata.py", line 503, in distribution
return Distribution.from_name(distribution_name)
File "/usr/lib/python3.8/importlib/metadata.py", line 177, in from_name
raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: torchserve
Any hint why torchserve
python package not installed in the docker image?
@hgong-snap gpu docker image should specify cuda version. For example: ./build_image.sh -bt dev -g -cv cu102
Here is the detail information about torchserve docker image build.
Could you please directly pull torchserve docker nightly build to quickly verify, which is based on cu102? eg. docker pull pytorch/torchserve-nightly:latest-gpu
@lxning verified onpytorch/torchserve-nightly:latest-gpu
and seems issue is solved!
@hgong-snap Great, thank you for the verification.
quick question:
in order to run torchserve with the fix locally (e.g. mac) without docker, will
pip install torchserve-nightly
be sufficient?
Yep, don't forget to run ts_scripts/install_dependencies
first though with dev dependencies and GPU if needed
Hi, I am still getting this issue. I have built the image through build.sh
🐛 Describe the bug
We encountered similar problems as https://github.com/pytorch/serve/issues/1531 and it happens quite often.
See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, it's kinda stuck in an unknown state: it didn't terminate itself, so no workers can be added automatically. But in the meantime, it won't receive incoming traffic. which essentially means we only have one worker(9001) now.
The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it can't receive any traffic. It still counted one active worker, thus torchserve won't add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found
Retry worker: 9001 in 1 seconds.
in the log )If I curl the management API, it still shows two works are all healthy.
Error logs
worker-9000 died because exception. It didn't have any log after 2022-08-25 21:21:14.056 PDT, selected logs:
[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]
[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected. [INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 "gRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0" 13 109 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr
Installation instructions
N/A unrelated
Model Packaing
N/A unrelated
config.properties
No response
Versions
used docker image of 0.6.0-gpu
Repro instructions
N/A
Possible Solution
No response