replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
8.13k stars 565 forks source link

Health check endpoint unresponsive during predictions #1719

Open ggilder opened 6 months ago

ggilder commented 6 months ago

Hi, I'm trying to use cog to build images I can run in a kubernetes cluster. I'm seeing an issue where I can't query the state of each pod (e.g. using the /health-check endpoint that cog defines) while a prediction is running.

I have a simple demonstration here: https://github.com/ggilder/cog-health-check-test

You can run the above on Docker Desktop. It's basically the cog "hello world" example with a sleep added to simulate a long-running prediction. The test script hits the prediction endpoint and then attempts to hit the health check endpoint, but the server waits for the prediction to complete before handling the health check, which doesn't seem to match the intended behavior (since e.g. the health state is defined to allow a "busy" state).

Am I missing something that would allow the server to respond to health checks during predictions?

ggilder commented 6 months ago

Perhaps the prediction endpoint should be declared with def rather than async def so that it doesn't block the main thread? (per FastAPI docs)

mattt commented 5 months ago

Hi @ggilder. Thank you so much for taking the time to put together that demo and contributing a possible fix with #1720.

What you're observing is a result of the asynchronous prediction handling introduced by https://github.com/replicate/cog/pull/870 interacting with a more recent change in #1355.

In your demo, adding -H "Prefer: respond-async" to the original curl command allows the healthcheck to complete. But I'm not sure how expected / obvious / useful that behavior is.

And as @nickstenning points out in that PR:

There's a --threads argument to cog.server.http and I believe that it functions as a concurrency control for the synchronous prediction create endpoint.

@technillogue You have more context about this change and async Python in general. Do you think we should reconsider #1355? Or is that a prerequisite for the async branch we're working to merge?

ggilder commented 5 months ago

I think there's definitely a use case here for running (synchronous) predictions on multiple VMs and check each one's readiness for additional work. The Prefer header is interesting but requiring use of webhooks isn't necessarily ideal.

Thanks for pointing out #1355 ; as I noted in https://github.com/replicate/cog/pull/1720 the change would change the behavior (ability to "use the network socket's accept queue as a mechanism to queue up work" as @nickstenning mentions in #1355 ). Whichever behavior is chosen (or even if there's an option to make it a user choice), I think this could be better documented as right now it's something you have to just discover by accident while using cog. :)