Model Loading Requests Contention

Describe the bug

The model loading requests are not balanced evenly across predictors. Each moment, the system can receive many requests to different (mostly unloaded) models. Instead of balancing the loading requests across all predictors, we see that one predictor can receive ~30 requests (out of ~50) while other predictors are completely idle (both in terms of model loading and inference processing). This obviously create temporary hotspots. These hotspots are not static, as the popular predictor changes over time, resulting "waves" of model loading requests per predictors (see image of 3 different predictors over time).

We suspect that each model loading request is routed to the same mm instance X since X is at the top of the priority queue from the perspective of each of the mm instances. Since it takes a few seconds for the system to understand that X is concurrently receiving many request and is considered "busy", X is receiving all the requests for a short period of time.

Is this hypothesis correct? If not, how can we debug this?

Configuration

We have a high workload environment, containing thousands of registered models, each of them is requested every couple of minutes, resulting a very high models-swap events rate (many unloading and loading events in a short period of time). We have a few dozen predictors, each can hold ~20 models. The modelmesh containers are load balanced (round robin) using GRPC, with the mm-balanced header set to true. Modelmesh is configured to use rpm-based decisions (busyness, scaling...) and not the experimental latency-based (is it worth trying?).

kserve / modelmesh-serving

Model Loading Requests Contention #469