During a service update, delayed resource release causes new revision pods to encounter errors and restart until resources are freed.

knative-extensions / serving-progressive-rollout

Knative Serving extension to roll out the revision progressively

Apache License 2.0

3 stars 6 forks source link

During a service update, delayed resource release causes new revision pods to encounter errors and restart until resources are freed. #204

Open AyushSawant18588 opened 1 month ago

AyushSawant18588 commented 1 month ago

When updating a service, the previous revision pods terminate, but resources like GPUs take time to free up. As a result, new revision pods initially encounter CUDA out of memory errors because the GPUs are still occupied with the model weights from the previous revision. Consequently, the new revision pods restart several times and take a few minutes to reach a running state.

Is this expected behavior for this scenario?

houshengbo commented 3 weeks ago

This is normal. All resources, including CPUs, Memory and GPUs, etc, are still taken up by the old revisions, as long as they still exist. Even if they are in terminating status, they occupy resources, before they are gone. New revisions have to wait till the resources are freed up to get enough resources. It is normal to see them retrying and restarting.