Open ggilder opened 6 months ago
Perhaps the prediction endpoint should be declared with def
rather than async def
so that it doesn't block the main thread? (per FastAPI docs)
Hi @ggilder. Thank you so much for taking the time to put together that demo and contributing a possible fix with #1720.
What you're observing is a result of the asynchronous prediction handling introduced by https://github.com/replicate/cog/pull/870 interacting with a more recent change in #1355.
In your demo, adding -H "Prefer: respond-async"
to the original curl
command allows the healthcheck to complete. But I'm not sure how expected / obvious / useful that behavior is.
And as @nickstenning points out in that PR:
There's a
--threads
argument tocog.server.http
and I believe that it functions as a concurrency control for the synchronous prediction create endpoint.
@technillogue You have more context about this change and async Python in general. Do you think we should reconsider #1355? Or is that a prerequisite for the async
branch we're working to merge?
I think there's definitely a use case here for running (synchronous) predictions on multiple VMs and check each one's readiness for additional work. The Prefer
header is interesting but requiring use of webhooks isn't necessarily ideal.
Thanks for pointing out #1355 ; as I noted in https://github.com/replicate/cog/pull/1720 the change would change the behavior (ability to "use the network socket's accept queue as a mechanism to queue up work" as @nickstenning mentions in #1355 ). Whichever behavior is chosen (or even if there's an option to make it a user choice), I think this could be better documented as right now it's something you have to just discover by accident while using cog. :)
Hi, I'm trying to use cog to build images I can run in a kubernetes cluster. I'm seeing an issue where I can't query the state of each pod (e.g. using the
/health-check
endpoint that cog defines) while a prediction is running.I have a simple demonstration here: https://github.com/ggilder/cog-health-check-test
You can run the above on Docker Desktop. It's basically the cog "hello world" example with a sleep added to simulate a long-running prediction. The test script hits the prediction endpoint and then attempts to hit the health check endpoint, but the server waits for the prediction to complete before handling the health check, which doesn't seem to match the intended behavior (since e.g. the health state is defined to allow a "busy" state).
Am I missing something that would allow the server to respond to health checks during predictions?