Open antedesk opened 1 year ago
@antedesk the error "io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer" usually means the client connection timeout. It does not indicate TorchServe is not healthy.
TorchServe provides plugin to allow users to plugin customized endpoint. "ping" endpoint is an example which is used by AWS SageMaker.
🐛 Describe the bug
We are using TorchServe to serve a
yolox_x
model trained by mmedet. We created a customized TorchServe docker image and we wrote a simpledocker-compose.yml
file which runs on a Debian 11 host with:Docker version 23.0.1, build a5ee5b1
Docker Compose version v2.15.1
In our current deployment, we are using a load balancer based on HAProxy that communicates with the TorchServe hosts. The HAProxy checks (every second) if the TorchServe hosts are up and running by using the route
GET $HOSTNAME:8080/ping
and if the response has status code200
and the response body contains the wordHealthy
then everything is ok.Unfortunately, looking at the TorchServe logs using the command
docker logs -f torchserve
(wheretorchserve
is the container name) we noticed a set of error each time the HAProxy checks the TorchServe host is up and running.Here an example of the errors.
where
LOADBLANCER_IP
is the anonymized ip of the load balancer host.When we stop the HAProxy service, the TorchServe instance stops logging the error. It looks like a non-blocking error but it is not a healthy behavior.
We are also performing the same issue with the default
torchserve
docker image and another custom model.A similar issue was opened in 2021, however closing/avoiding the health check is not a valid option in our scenario.
Error logs
Where
LOADBLANCER_IP
is the anonymized ip of the loadbalancer hostInstallation instructions
I am using Docker with docker-compose and a custom image for running mmdetection object detection models (according to the official docs).
In the following, the code I defined for:
entrypoint.sh
andconfig.properties
The Dockerfile is defined as follows.
The config.properties is available here
The entrypoint.sh is available here
To build the image:
docker build --pull -t mmdet-torchserve-cpu:2.28.1
Finally, the docker-compose.yml is defined as follows
Model Packaging
The defined handler.py is based on the one proposed by OpenMMLab. We just changed the input/output and add some basic error propagation.
To package the model as
.mar
, please refer to the following mmdet2torchserve.py file and the official doc of mmdetconfig.properties
The config.properties is defined for the specific model and it's defined in the
/home/model-server/model-store/config.properties
on the container.We don't override the default config.properties file defined in
/home/model-server/config.properties
.Versions
The
python serve/ts_scripts/print_env_info.py
doesn't work in the docker containerThe torch* libs are
The installed pip libs are the following ones.
Repro instructions
GET $HOSTNAME:8080/ping
where$HOSTNAME
is the host where TorchServe is running under docker-composedocker-compose up -d --build
to run the docker-compose fdocker logs -f torchserve
to see the errorPossible Solution
No response