Open TalkUHulk opened 4 years ago
@Mister5ive: Please update the context details. Also, attach the ts_log.log
and access_log.log
file from the logs
directory.
Questions :
@Mister5ive: Please update the context details. Also, attach the
ts_log.log
andaccess_log.log
file from thelogs
directory.Questions :
When is the error observed? While scaling up the model workers or at the time of inference? this happened during production environment.
Does the model works fine with a single worker? not really.even 1 worker the case would happen.
Could you share a minimal reproducer? docker image: docker pull mister5ive/torchserve:gpu-10.2-cudnn7-runtime-ubuntu18.04
thanks for your reply
@Mister5ive: Thanks for the updates. I will try and reproduce this error at my end and will revert.
@Mister5ive: Please share a sample input and a sample inference request command for your model. I could not reproduce this with simple curl based requests.
(echo -n '{"base64": "'; base64 /home/ubuntu/serve/examples/image_classifier/kitten.jpg; echo '"}') | curl -H "Content-Type: application/json" -d @- http://localhost:8080/predictions/SpamCls
Questions:
@Mister5ive: Please share a sample input and a sample inference request command for your model. I could not reproduce this with simple curl based requests.
(echo -n '{"base64": "'; base64 /home/ubuntu/serve/examples/image_classifier/kitten.jpg; echo '"}') | curl -H "Content-Type: application/json" -d @- http://localhost:8080/predictions/SpamCls
this is my test script.I delete useless code and u just need change manger.py->model_api. sample.zip
This problem was found under large-scale concurrency production environment.
Questions:
- Was there any specific reason to split the handler into two files? No,I just refer to the following: https://towardsdatascience.com/deploy-models-and-create-custom-handlers-in-torchserve-fc2d048fbe91
- Why are you using a different docker image than the officially available one? I create TorchServe docker image from source according to https://github.com/pytorch/serve/blob/master/docker/README.md#create-torchserve-docker-image-from-source
- Is there any other process running on the machine which might be using GPU? there is no other process running on the machine ;
thanks for your reply and excuse my poor English
@Mister5ive: Thank you for the script. I will run the load test with the same input and let you know.
@Mister5ive : Could you also share the docker run
command used for starting the container.
@Mister5ive : Could you also share the
docker run
command used for starting the container. this is my docker run command used for starting the container: docker run -itd --runtime=nvidia -p 8080:8080 -p 8081:8081 --name ${gpu_container} -v /opt/app/${server}/models/model-store:/home/model-server/model-store torchserve:gpu-10.2-cudnn7-runtime-ubuntu18.04 torchserve --start --model-store model-store --ts-config /home/model-server/model-store/ts.properties --models ${model_name}
@Mister5ive: I executed 100000
requests on your model running inside the docker container created by the image shared by you and all the requests succeeded successfully.
Question: Is there any memory restriction imposed on docker containers in your production environment?
Also, the nvidia-smi
output shared by you shows the python process consumes ~5 GB memory, however, in my testing both the processes consume only ~ 600 MB
@Mister5ive: I executed
100000
requests on your model running inside the docker container created by the image shared by you and all the requests succeeded successfully.Question: Is there any memory restriction imposed on docker containers in your production environment?
Also, the
nvidia-smi
output shared by you shows the python process consumes ~5 GB memory, however, in my testing both the processes consume only ~ 600 MB
1.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory. As follows: model_snapshot={"name":"startup.cfg","modelCount": 1,"models":{"SpamCls":{"1.0":{"defaultVersion": true, "marName": "SpamCls.mar", "minWorkers": 2, "maxWorkers": 2,"batchSize": 30, "maxBatchDelay": 50, "responseTimeout": 120}}}} 2.In my production environment,i use 80 worker(by python flask) request torchserve constantly and concurrently.
The following gif show 1 torchserve worker restart problem:
thanks for your reply
.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory.
Restoring from model_snapshot should not create a problem. I used the following command model registration, which matches the parameters in your model_snapshot
curl -X POST "http://localhost:8081/models?url=SpamCls.mar&initial_workers=2&batch_size=30&max_batch_delay=50"
In my production environment,i use 80 worker(by python flask) request torchserve constantly and concurrently.
I used this simple shell script to run a load test
#!/bin/bash
for i in {1..10000};
do
(echo -n '{"base64": "'; base64 /home/ubuntu/timg.jpeg; echo '"}') | curl -H "Content-Type: application/json" -d @- http://localhost:8080/predictions/SpamCls &
if (( $i % 100 == 0 ))
then
sleep 5;
fi
done;
Could you share your load test script to reproduce this issue?
.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory.
Restoring from model_snapshot should not create a problem. I used the following command model registration, which matches the parameters in your
model_snapshot
curl -X POST "http://localhost:8081/models?url=SpamCls.mar&initial_workers=2&batch_size=30&max_batch_delay=50"
In my production environment,i use 80 worker(by python flask) request torchserve constantly and concurrently.
I used this simple shell script to run a load test
#!/bin/bash for i in {1..10000}; do (echo -n '{"base64": "'; base64 /home/ubuntu/timg.jpeg; echo '"}') | curl -H "Content-Type: application/json" -d @- http://localhost:8080/predictions/SpamCls & if (( $i % 100 == 0 )) then sleep 5; fi done;
Could you share your load test script to reproduce this issue?
Sorry to reply so late.
I reproduce this issue in test environment by using locust,and attached code is my test script. In the script, kk is the URL of 10000 test images that Include GIF, JPG and PNG formats. The pictures are from the company, so it is not convenient to provide. Running the test scripts by “ locust -f test.py”.
test.py.zip As shown in the picture, QPS up to 100+.
The gif show this issue:
I am getting the same error while load testing using k6. There were no failed request with 100 users on increasing the users to 1000 http requests fails with 503 error. Below are TS logs.
2021-10-19 16:13:12,395 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,396 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,393 [ERROR] epollEventLoopGroup-3-5 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,392 [ERROR] epollEventLoopGroup-3-2 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,396 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2
2021-10-19 16:13:12,392 [ERROR] epollEventLoopGroup-3-8 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-2 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [INFO ] W-9000-mnist_1.0 ACCESS_LOG - /127.0.0.1:59690 "POST /v1/models/mnist:predict HTTP/1.1" 200 353
2021-10-19 16:13:12,400 [INFO ] W-9000-mnist_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:ip-172-31-27-189,timestamp:null
Your issue may already be reported! Please search on the issue tracker before creating one.
Context
Your Environment
Installed using source? [yes/no]:no
Are you planning to deploy it using docker container? [yes/no]:yes
Is it a CPU or GPU environment?: GPU
Using a default/custom handler? [If possible upload/share custom handler/model]: custom-model:https://drive.google.com/file/d/1kHwvoyZYPgro4FLvlWDYjUqSqjE0MiQp/view?usp=sharing
What kind of model is it e.g. vision, text, audio?:vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]:
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: 【config.properties】: inference_address=http://0.0.0.0:8080 management_address=http://0.0.0.0:8081 NUM_WORKERS=1 number_of_gpu=1 number_of_netty_threads=32 job_queue_size=1000 model_snapshot={"name":"startup.cfg","modelCount": 1,"models":{"SpamCls":{"1.0":{"defaultVersion": true, "marName": "SpamCls.mar", "minWorkers": 2, "maxWorkers": 2,"batchSize": 30, "maxBatchDelay": 50, "responseTimeout": 120}}}} 【logs】 https://drive.google.com/file/d/1tWLgLzunMos8LwtT_BkNh85npD6c7DNp/view?usp=sharing
Link to your project [if any]:
Expected Behavior
Current Behavior
One of the workers restarts at intervals.
Possible Solution
Steps to Reproduce
...
Failure Logs [if any]
2020-10-28 04:47:31,640 [INFO ] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 41 2020-10-28 04:47:31,640 [DEBUG] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.Job - Waiting time ns: 49196663, Backend time ns: 41236071 2020-10-28 04:47:31,640 [DEBUG] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.Job - Waiting time ns: 40122864, Backend time ns: 41284987 2020-10-28 04:47:31,805 [ERROR] epollEventLoopGroup-3-9 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,226 [ERROR] epollEventLoopGroup-3-10 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,730 [ERROR] epollEventLoopGroup-3-11 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,874 [ERROR] epollEventLoopGroup-3-12 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,014 [ERROR] epollEventLoopGroup-3-13 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,094 [ERROR] epollEventLoopGroup-3-14 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,104 [ERROR] epollEventLoopGroup-3-15 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer