tensorflow / serving

A flexible, high-performance serving system for machine learning models
https://www.tensorflow.org/serving
Apache License 2.0
6.16k stars 2.19k forks source link

Add health check to Dockerfile #2219

Open alejones opened 4 months ago

alejones commented 4 months ago

Feature Request

Add a health check to the Dockerfile

Describe the problem the feature is intended to solve

I'm using a docker container on an edge device and would like to be able to shut it down if it becomes unresponsive.

Describe the solution

Inside the docker file add a HEALTHCHECK section. I'll leave it up to the TensorFlow team to decide how it should be checked. Just checking the container is alive would be a start, actually checking that it can return predictions would be even better.

For checking if it is alive, I use this in Jupyter notebooks. This isn't a complete solution, but just an idea.

curl http://10.0.0.10:8501/v1/models/my_model

For checking if predictions can be returned, I would be happy with a prediction from the half plus two model that is used when testing the dockerfile.

This is an exert from the linked Docker documentation on how to add health checks.

HEALTHCHECK --interval=5m --timeout=3s \
  [CMD](https://docs.docker.com/reference/dockerfile/#cmd) curl -f http://localhost/ || exit 1

Describe alternatives you've considered

None yet. Open to suggestions on how to restart a contain started with systemd

Additional context

I'm using rootless Podman and Ubuntu 24.04 on an edge device. I'd like to use the health checking built into Podman to be able to kill the container, and then have systemd bring up a fresh container. I have no problems starting and using TensorFlow Serving with Podman, but monitoring does not currently work for me.

alejones commented 4 months ago

Any updates on this?

alejones commented 3 months ago

Checking again if there any any udates.

YanghuaHuang commented 3 months ago

Sorry about the late reply. I personally didn't get the chance and bandwidth to look into this. Unassign myself as I am not actively working on this.

alejones commented 3 months ago

Thanks for the update, just want to make sure it doesn't get lost.