openvinotoolkit / model_server

A scalable inference server for models optimized with OpenVINO™
https://docs.openvino.ai/2024/ovms_what_is_openvino_model_server.html
Apache License 2.0
658 stars 208 forks source link

[Bug]: Connection reset by peer code 14. #2243

Open dasantosa opened 9 months ago

dasantosa commented 9 months ago

OpenVINO Version

2023.0

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

Framework

PyTorch

Model used

No response

Issue description

I'm usign GRPC to make requests to predict service. When I run it on a local machine I have no problems, but when I deploy it on AWS, I sometimes get "Connection Reset By Peer Error". It doesn't follow a sequence, that is, It happens ramdomly and I need to reopen the channel. I am using a python API creating connection and using predict endpoint in this way:

self.channel = grpc.insecure_channel('{host}:{port}'.format(host=self.host, port=self.port),options=self.options)
stub = prediction_service_pb2_grpc.PredictionServiceStub(self.channel)

request = predict_pb2.PredictRequest()
request.model_spec.name = self.model_name
request.model_spec.signature_name = self.signature_name
result = stub.Predict(request, 30, wait_for_ready=True)
return result

I have the following configuration:

grpc.max_message_length = 100 * 1024 *1024
grpc.max_receive_message_length = 128 * 1024 *1024
grpc.enable_http_proxy = 0
grpc.keepalive_time_ms =  2147483647
grpc.max_connection_idle_ms' = 2147483647
grpc.max_connection_age_ms = 2147483647
grpc.max_connection_age_grace_ms = 2147483647
grpc.client_idle_timeout_ms=  2147483647

Step-by-step reproduction

No response

Relevant log output

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Connection reset by peer"
    debug_error_string = "{"created":"@1699622642.536271816","description":"Error received from peer ipv4:x.x.x.x:9000","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Connection reset by peer","grpc_status":14}"

Issue submission checklist

mlukasze commented 9 months ago

isn't it a problem of OVMS? @dtrawins fyi

atobiszei commented 8 months ago

@desantosa Could you share OVMS logs with log_level DEBUG? Did you try using OVMS client as an alternative?

dasantosa commented 8 months ago

@atobiszei thanks for your response! I tried to use the OVMS client and the error occurs the same way. That was the reason I implemented my own version, which is exactly the same as the ovms client but adding some additional grpc features. On the server side, with log_level DEBUG I didn't receive anything when the error occurred so I can't attach information about it...

However, I add some additional code. I try to check the channel status before sending the request, but it doesn't work as I expected:

state = self.channel._channel.check_connectivity_state(True)

 if state != 0 and state != 2:
      self.channel.close()
      self.reinitchannel_and_checkserverconectivity()

And the libraries that I use with their versions:

grpcio==1.59.3
grpclib==0.4.6
protobuf==3.19.0
requests~=2.31.0
numpy==1.19.5
mzegla commented 8 months ago

I don't think it's something to fix on the client side. My guess would be networking especially that you say it always works when you deploy locally and the issue is only on AWS.

It doesn't follow a sequence, that is, It happens ramdomly and I need to reopen the channel.

When you deploy to AWS is it always okay at the beginning - for the first few requests and the it stops working - or completely random? When you encounter that error, do you do something on the deployment side (on AWS) or just reconnect the client?

dasantosa commented 8 months ago

Exactly, when I deploy it on AWS it works fine, I tested it by making a request with the same image for a few hours with a random delay between 1 and 5 minutes. Sometimes it fails and I have to handle the 503 exception. When the exception occurs, I just close the channel and reopen it and it works fine again till the next exception.