Open AyushSawant18588 opened 1 month ago
This is normal. All resources, including CPUs, Memory and GPUs, etc, are still taken up by the old revisions, as long as they still exist. Even if they are in terminating status, they occupy resources, before they are gone. New revisions have to wait till the resources are freed up to get enough resources. It is normal to see them retrying and restarting.
When updating a service, the previous revision pods terminate, but resources like GPUs take time to free up. As a result, new revision pods initially encounter CUDA out of memory errors because the GPUs are still occupied with the model weights from the previous revision. Consequently, the new revision pods restart several times and take a few minutes to reach a running state.
Is this expected behavior for this scenario?