replicate / replicate-python

Python client for Replicate
https://replicate.com
Apache License 2.0
744 stars 212 forks source link

model container failed to boot and complete setup within 600 seconds #307

Closed christopher5106 closed 3 months ago

christopher5106 commented 3 months ago

Hi,

We are currently experiencing boot error while our code has not changed?!

We don't have much information except "model container failed to boot and complete setup within 600 seconds"

What could be the issue ? The built docker runs well locally. The issue is not always persistent.

erbridge commented 3 months ago

Please contact support for help with your specific model: https://replicate.com/support

christopher5106 commented 3 months ago

Support has been contacted earlier this morning, issue is still not solve. And What about such an error message without more clues, without log,

christopher5106 commented 3 months ago

It should be planned asap to have more explicit error messages

christopher5106 commented 3 months ago

Thanks for the update with the setup logs! That's a very good news.

Your communication on Github should be improved, I just discovered it this morning.

mattt commented 3 months ago

Hi @christopher5106. Sorry we didn't do a great job helping you out here. It sounds like the setup logs on replicate.com gave you the information you needed to solve your problem. Was there anything else we can help you with?

christopher5106 commented 3 months ago

@mattt after discussion with my team, we are not sure the setup logs were there before, because we gave attention to them only after we got machines not booting at all:

model container failed to boot and complete setup within 600 seconds

As you can see, yesterday, while your status page was totally fine with green flags, we were unable to start any prediction for the day of yesterday

image image

We don't know what we should do when we see the message

model container failed to boot and complete setup within 600 seconds

because it's not explicit enough to take action for us. Today while the docker image has not been changed, we have no more problem, but still a bit slow booting times.

We experienced the non booting issue on multiple model endpoints, and also on deployment endpoints.

mattt commented 3 months ago

@christopher5106 Sorry, I'm having trouble finding a support ticket with more details about your case. Could you please share a link to the model that's failing to boot?

christopher5106 commented 3 months ago

I sent you by PM over X half a dozen versions we published for which there were a few dozen boot failures yesterday.

mattt commented 3 months ago

@christopher5106 Thanks for your patience. I escalated internally, and our customer support engineer has responded to your ticket. I'll let them take it from there.

christopher5106 commented 3 months ago

Thanks!