worker restart and logs “epollEventLoopGroup-3-10 org.pytorch.serve.http.HttpRequestHandler”

TalkUHulk commented 4 years ago

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

torchserve version:TorchServe Version is 0.2.0
torch version:1.6.0
torch-model-archiver:0.2.0
torchvision version [if any]:
torchtext version [if any]:
torchaudio version [if any]:
java version:openjdk version "11.0.8" 2020-07-14
Operating System and version:Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) )

Your Environment

Installed using source? [yes/no]:no
Are you planning to deploy it using docker container? [yes/no]:yes
Is it a CPU or GPU environment?: GPU
Using a default/custom handler? [If possible upload/share custom handler/model]: custom-model:https://drive.google.com/file/d/1kHwvoyZYPgro4FLvlWDYjUqSqjE0MiQp/view?usp=sharing
What kind of model is it e.g. vision, text, audio?:vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]:
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: 【config.properties】: inference_address=http://0.0.0.0:8080 management_address=http://0.0.0.0:8081 NUM_WORKERS=1 number_of_gpu=1 number_of_netty_threads=32 job_queue_size=1000 model_snapshot={"name":"startup.cfg","modelCount": 1,"models":{"SpamCls":{"1.0":{"defaultVersion": true, "marName": "SpamCls.mar", "minWorkers": 2, "maxWorkers": 2,"batchSize": 30, "maxBatchDelay": 50, "responseTimeout": 120}}}} 【logs】 https://drive.google.com/file/d/1tWLgLzunMos8LwtT_BkNh85npD6c7DNp/view?usp=sharing
Link to your project [if any]:

Expected Behavior

Current Behavior

One of the workers restarts at intervals.

Possible Solution

Steps to Reproduce

...

Failure Logs [if any]

2020-10-28 04:47:31,640 [INFO ] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 41 2020-10-28 04:47:31,640 [DEBUG] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.Job - Waiting time ns: 49196663, Backend time ns: 41236071 2020-10-28 04:47:31,640 [DEBUG] W-9000-SpamCls_1.0 org.pytorch.serve.wlm.Job - Waiting time ns: 40122864, Backend time ns: 41284987 2020-10-28 04:47:31,805 [ERROR] epollEventLoopGroup-3-9 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,226 [ERROR] epollEventLoopGroup-3-10 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,730 [ERROR] epollEventLoopGroup-3-11 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:32,874 [ERROR] epollEventLoopGroup-3-12 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,014 [ERROR] epollEventLoopGroup-3-13 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,094 [ERROR] epollEventLoopGroup-3-14 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2020-10-28 04:47:33,104 [ERROR] epollEventLoopGroup-3-15 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

harshbafna commented 4 years ago

@Mister5ive: Please update the context details. Also, attach the ts_log.log and access_log.log file from the logs directory.

Questions :

When is the error observed? While scaling up the model workers or at the time of inference?
Does the model works fine with a single worker?
Could you share a minimal reproducer?

TalkUHulk commented 4 years ago

@Mister5ive: Please update the context details. Also, attach the ts_log.log and access_log.log file from the logs directory.

Questions :

When is the error observed? While scaling up the model workers or at the time of inference? this happened during production environment.

Does the model works fine with a single worker? not really.even 1 worker the case would happen.

Could you share a minimal reproducer? docker image: docker pull mister5ive/torchserve:gpu-10.2-cudnn7-runtime-ubuntu18.04

thanks for your reply

harshbafna commented 4 years ago

@Mister5ive: Thanks for the updates. I will try and reproduce this error at my end and will revert.

harshbafna commented 4 years ago

@Mister5ive: Please share a sample input and a sample inference request command for your model. I could not reproduce this with simple curl based requests.

(echo -n '{"base64": "'; base64 /home/ubuntu/serve/examples/image_classifier/kitten.jpg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://localhost:8080/predictions/SpamCls

Questions:

Was there any specific reason to split the handler into two files?
Why are you using a different docker image than the officially available one?
Is there any other process running on the machine which might be using GPU?

TalkUHulk commented 4 years ago

@Mister5ive: Please share a sample input and a sample inference request command for your model. I could not reproduce this with simple curl based requests.
(echo -n '{"base64": "'; base64 /home/ubuntu/serve/examples/image_classifier/kitten.jpg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://localhost:8080/predictions/SpamCls
this is my test script.I delete useless code and u just need change manger.py->model_api. sample.zip

This problem was found under large-scale concurrency production environment.

Questions:

Was there any specific reason to split the handler into two files? No,I just refer to the following: https://towardsdatascience.com/deploy-models-and-create-custom-handlers-in-torchserve-fc2d048fbe91

Why are you using a different docker image than the officially available one? I create TorchServe docker image from source according to https://github.com/pytorch/serve/blob/master/docker/README.md#create-torchserve-docker-image-from-source

Is there any other process running on the machine which might be using GPU? there is no other process running on the machine ;

thanks for your reply and excuse my poor English

harshbafna commented 4 years ago

@Mister5ive: Thank you for the script. I will run the load test with the same input and let you know.

harshbafna commented 4 years ago

@Mister5ive : Could you also share the docker run command used for starting the container.

TalkUHulk commented 4 years ago

@Mister5ive : Could you also share the docker run command used for starting the container. this is my docker run command used for starting the container: docker run -itd --runtime=nvidia -p 8080:8080 -p 8081:8081 --name ${gpu_container} -v /opt/app/${server}/models/model-store:/home/model-server/model-store torchserve:gpu-10.2-cudnn7-runtime-ubuntu18.04 torchserve --start --model-store model-store --ts-config /home/model-server/model-store/ts.properties --models ${model_name}

harshbafna commented 4 years ago

@Mister5ive: I executed 100000 requests on your model running inside the docker container created by the image shared by you and all the requests succeeded successfully.

Question: Is there any memory restriction imposed on docker containers in your production environment?

Also, the nvidia-smi output shared by you shows the python process consumes ~5 GB memory, however, in my testing both the processes consume only ~ 600 MB

TalkUHulk commented 4 years ago

@Mister5ive: I executed 100000 requests on your model running inside the docker container created by the image shared by you and all the requests succeeded successfully.

Question: Is there any memory restriction imposed on docker containers in your production environment?

Also, the nvidia-smi output shared by you shows the python process consumes ~5 GB memory, however, in my testing both the processes consume only ~ 600 MB

1.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory. As follows： model_snapshot={"name":"startup.cfg","modelCount": 1,"models":{"SpamCls":{"1.0":{"defaultVersion": true, "marName": "SpamCls.mar", "minWorkers": 2, "maxWorkers": 2,"batchSize": 30, "maxBatchDelay": 50, "responseTimeout": 120}}}} 2.In my production environment，i use 80 worker(by python flask) request torchserve constantly and concurrently.

The following gif show 1 torchserve worker restart problem: Jietu20201029-213901-HD

thanks for your reply

harshbafna commented 4 years ago

.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory.

Restoring from model_snapshot should not create a problem. I used the following command model registration, which matches the parameters in your model_snapshot

curl -X POST "http://localhost:8081/models?url=SpamCls.mar&initial_workers=2&batch_size=30&max_batch_delay=50"

In my production environment，i use 80 worker(by python flask) request torchserve constantly and concurrently.

I used this simple shell script to run a load test

#!/bin/bash

for i in {1..10000};
do
    (echo -n '{"base64": "'; base64 /home/ubuntu/timg.jpeg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://localhost:8080/predictions/SpamCls &
    if (( $i % 100 == 0 ))
    then
        sleep 5;
    fi
done;

Could you share your load test script to reproduce this issue?

TalkUHulk commented 4 years ago

.I set torchserve batch Inference by ‘model_snapshot’ in config.properties, so this may lead to the python process consumes ~5 GB memory.

Restoring from model_snapshot should not create a problem. I used the following command model registration, which matches the parameters in your model_snapshot
curl -X POST "http://localhost:8081/models?url=SpamCls.mar&initial_workers=2&batch_size=30&max_batch_delay=50"
In my production environment，i use 80 worker(by python flask) request torchserve constantly and concurrently.

I used this simple shell script to run a load test
#!/bin/bash

for i in {1..10000};
do
  (echo -n '{"base64": "'; base64 /home/ubuntu/timg.jpeg; echo '"}') | curl -H "Content-Type: application/json" -d @-  http://localhost:8080/predictions/SpamCls &
  if (( $i % 100 == 0 ))
  then
      sleep 5;
  fi
done;
Could you share your load test script to reproduce this issue?

Sorry to reply so late.

I reproduce this issue in test environment by using locust,and attached code is my test script. In the script, kk is the URL of 10000 test images that Include GIF, JPG and PNG formats. The pictures are from the company, so it is not convenient to provide. Running the test scripts by “ locust -f test.py”.

test.py.zip As shown in the picture， QPS up to 100+.

The gif show this issue:

Jietu20201031-154705-HD

jagadeeshi2i commented 3 years ago

I am getting the same error while load testing using k6. There were no failed request with 100 users on increasing the users to 1000 http requests fails with 503 error. Below are TS logs.

2021-10-19 16:13:12,395 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,396 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,393 [ERROR] epollEventLoopGroup-3-5 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,392 [ERROR] epollEventLoopGroup-3-2 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,396 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2
2021-10-19 16:13:12,392 [ERROR] epollEventLoopGroup-3-8 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,397 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-2 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler - 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2021-10-19 16:13:12,400 [INFO ] W-9000-mnist_1.0 ACCESS_LOG - /127.0.0.1:59690 "POST /v1/models/mnist:predict HTTP/1.1" 200 353
2021-10-19 16:13:12,400 [INFO ] W-9000-mnist_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:ip-172-31-27-189,timestamp:null

pytorch / serve