pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.21k stars 860 forks source link

worker restart and logs “epollEventLoopGroup-3-10 org.pytorch.serve.http.HttpRequestHandler” #756

Open TalkUHulk opened 4 years ago

TalkUHulk commented 4 years ago

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

Your Environment

Expected Behavior

Current Behavior

One of the workers restarts at intervals. image

Possible Solution

Steps to Reproduce

...

Failure Logs [if any]

2020-10-28 04:47:31,640 [INFO ] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 41 2020-10-28 04:47:31,640 [DEBUG] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.Job - Waiting time ns: 49196663, Backend time ns: 41236071 2020-10-28 04:47:31,640 [DEBUG] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.Job - Waiting time ns: 40122864, Backend time ns: 41284987 2020-10-28 04:47:31,805 [ERROR] epollEventLoopGroup-3-9 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,226 [ERROR] epollEventLoopGroup-3-10 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,730 [ERROR] epollEventLoopGroup-3-11 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,874 [ERROR] epollEventLoopGroup-3-12 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,014 [ERROR] epollEventLoopGroup-3-13 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,094 [ERROR] epollEventLoopGroup-3-14 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,104 [ERROR] epollEventLoopGroup-3-15 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

harshbafna commented 4 years ago

@Mister5ive: Please update the context details. Also, attach the ts_log.log and access_log.log file from the logs directory.

Questions :

TalkUHulk commented 4 years ago

@Mister5ive: Please update the context details. Also, attach the ts_log.log and access_log.log file from the logs directory.

Questions :

  • When is the error observed? While scaling up the model workers or at the time of inference? this happened during production environment.

  • Does the model works fine with a single worker? not really.even 1 worker the case would happen.

  • Could you share a minimal reproducer? docker image: docker pull mister5ive/torchserve:gpu-10.2-cudnn7-runtime-ubuntu18.04

thanks for your reply

harshbafna commented 4 years ago

@Mister5ive: Thanks for the updates. I will try and reproduce this error at my end and will revert.

harshbafna commented 4 years ago

@Mister5ive: Please share a sample input and a sample inference request command for your model. I could not reproduce this with simple curl based requests.

(echo -n '{"base64": "'; base64 /home/ubuntu/serve/examples/image_classifier/kitten.jpg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://localhost:8080/predictions/SpamCls

Questions:

TalkUHulk commented 4 years ago

@Mister5ive: Please share a sample input and a sample inference request command for your model. I could not reproduce this with simple curl based requests.

(echo -n '{"base64": "'; base64 /home/ubuntu/serve/examples/image_classifier/kitten.jpg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://localhost:8080/predictions/SpamCls
  • this is my test script.I delete useless code and u just need change manger.py->model_api. sample.zip

  • This problem was found under large-scale concurrency production environment.

Questions:

thanks for your reply and excuse my poor English

harshbafna commented 4 years ago

@Mister5ive: Thank you for the script. I will run the load test with the same input and let you know.

harshbafna commented 4 years ago

@Mister5ive : Could you also share the docker run command used for starting the container.

TalkUHulk commented 4 years ago

@Mister5ive : Could you also share the docker run command used for starting the container. this is my docker run command used for starting the container: docker run -itd --runtime=nvidia -p 8080:8080 -p 8081:8081 --name ${gpu_container} -v /opt/app/${server}/models/model-store:/home/model-server/model-store torchserve:gpu-10.2-cudnn7-runtime-ubuntu18.04 torchserve --start --model-store model-store --ts-config /home/model-server/model-store/ts.properties --models ${model_name}

harshbafna commented 4 years ago

@Mister5ive: I executed 100000 requests on your model running inside the docker container created by the image shared by you and all the requests succeeded successfully.

Question: Is there any memory restriction imposed on docker containers in your production environment?

Also, the nvidia-smi output shared by you shows the python process consumes ~5 GB memory, however, in my testing both the processes consume only ~ 600 MB

Screen Shot 2020-10-29 at 6 50 51 PM
TalkUHulk commented 4 years ago

@Mister5ive: I executed 100000 requests on your model running inside the docker container created by the image shared by you and all the requests succeeded successfully.

Question: Is there any memory restriction imposed on docker containers in your production environment?

Also, the nvidia-smi output shared by you shows the python process consumes ~5 GB memory, however, in my testing both the processes consume only ~ 600 MB

Screen Shot 2020-10-29 at 6 50 51 PM

1.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory. As follows: model_snapshot={"name":"startup.cfg","modelCount": 1,"models":{"SpamCls":{"1.0":{"defaultVersion": true, "marName": "SpamCls.mar", "minWorkers": 2, "maxWorkers": 2,"batchSize": 30, "maxBatchDelay": 50, "responseTimeout": 120}}}} 2.In my production environment,i use 80 worker(by python flask) request torchserve constantly and concurrently.

The following gif show 1 torchserve worker restart problem: Jietu20201029-213901-HD

thanks for your reply

harshbafna commented 4 years ago

.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory.

Restoring from model_snapshot should not create a problem. I used the following command model registration, which matches the parameters in your model_snapshot

curl -X POST "http://localhost:8081/models?url=SpamCls.mar&initial_workers=2&batch_size=30&max_batch_delay=50"

In my production environment,i use 80 worker(by python flask) request torchserve constantly and concurrently.

I used this simple shell script to run a load test

#!/bin/bash

for i in {1..10000};
do
    (echo -n '{"base64": "'; base64 /home/ubuntu/timg.jpeg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://localhost:8080/predictions/SpamCls &
    if (( $i % 100 == 0 ))
    then
        sleep 5;
    fi
done;

Could you share your load test script to reproduce this issue?

TalkUHulk commented 4 years ago

.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory.

Restoring from model_snapshot should not create a problem. I used the following command model registration, which matches the parameters in your model_snapshot

curl -X POST "http://localhost:8081/models?url=SpamCls.mar&initial_workers=2&batch_size=30&max_batch_delay=50"

In my production environment,i use 80 worker(by python flask) request torchserve constantly and concurrently.

I used this simple shell script to run a load test

#!/bin/bash

for i in {1..10000};
do
  (echo -n '{"base64": "'; base64 /home/ubuntu/timg.jpeg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://localhost:8080/predictions/SpamCls &
  if (( $i % 100 == 0 ))
  then
      sleep 5;
  fi
done;

Could you share your load test script to reproduce this issue?

Sorry to reply so late.

I reproduce this issue in test environment by using locust,and attached code is my test script. In the script, kk is the URL of 10000 test images that Include GIF, JPG and PNG formats. The pictures are from the company, so it is not convenient to provide. Running the test scripts by “ locust -f test.py”.

test.py.zip As shown in the picture, QPS up to 100+. image image

The gif show this issue:

Jietu20201031-154705-HD

jagadeeshi2i commented 3 years ago

I am getting the same error while load testing using k6. There were no failed request with 100 users on increasing the users to 1000 http requests fails with 503 error. Below are TS logs.

2021-10-19 16:13:12,395 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,396 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,393 [ERROR] epollEventLoopGroup-3-5 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,392 [ERROR] epollEventLoopGroup-3-2 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,396 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2
2021-10-19 16:13:12,392 [ERROR] epollEventLoopGroup-3-8 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-2 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [INFO ] W-9000-mnist_1.0 ACCESS_LOG - /127.0.0.1:59690 "POST /v1/models/mnist:predict HTTP/1.1" 200 353
2021-10-19 16:13:12,400 [INFO ] W-9000-mnist_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:ip-172-31-27-189,timestamp:null