Open Varun2101 opened 10 months ago
Hi, @Varun2101. Thanks for sharing this feedback.
To your second point, you can get more control over the behavior of a model on Replicate by creating a deployment. I don't believe we provide a way to configure the timing for autoscaling a deployment, but that's something we've discussed.
I've built a workaround that simply pings the model (to minimize tokens, I ask it to respond with a single character). This takes .1 seconds of runtime, and I set the ping frequency to 1 minute. After 15 minutes of inactivity the pings stop. While it doesn't solve replicate's initial cold boot issue, this keeps the model warm while a user is active with negligible cost - far cheaper than created a dedicated deployment.
Hi, I'm working on deploying a private language model to production through Replicate. I have requests coming in sporadically so provisioning always-on servers is not feasible for me, but I would like requests to be handled at my max concurrency for increased speed. Currently I face 2-2.5 minutes of cold start for each instance and they terminate after 1 minute each, which can lead to some frustrating delays that are longer than necessary. Would it be possible to add either of these functionalities?
Currently I'm attempting a workaround for no.1 by burst-pinging the model early with the default input n times, but the short idle time means that there's still a good chance that the instances get terminated before I send any actual requests. Let me know if you have a better solution. Thanks!