kserve / modelmesh-serving

Controller for ModelMesh
Apache License 2.0
204 stars 114 forks source link

8 retry iterations exhausted #341

Open yc2984 opened 1 year ago

yc2984 commented 1 year ago

Is your feature request related to a problem? If so, please describe.

I encountered this issue while stress testing modelmesh. Exception invoking inference.GRPCInferenceService/ModelInfer method of resolved model enrxorg003-5514162725715968__isvc-8e77e8507b of vmodel enrxorg003-5514162725715968: InternalException: 8 retry iterations exhausted

We've implemented some retry mechanism on certain status code, INFERENCE_SERVICE_SETTINGS_RETRYABLE_STATUS = ["UNAVAILABLE", "DEADLINE_EXCEEDED", "CANCELLED", "RESOURCE_EXHAUSTED"]

However, this one doesn't seem to have the proper status code set.

What should be the correct way to handle this error? This typically happens under high load, so it should be retryable.

Describe your proposed solution

Set the proper GRPC status code so the user can retry it.

Describe alternatives you have considered

Probably configurable number of retires on mm container?

Additional context

tjohnson31415 commented 1 year ago

To provide some additional context: the error log shows that there have been 8 attempts at retrying the request within ModelMesh. Since the models are allocated to instances dynamically and the instance that receives the request may not have the model loaded locally, ModelMesh has its own internal requesting system. There are a number of reasons that those internal retries can happen such as:

Over the course of attempting to process a single external request, some combination of these internal request failures can result in InternalException: 8 retry iterations exhausted. It looks like there is special handling of the case that the internal instances return RESOURCE_EXHAUSTED to forward that to the client. I think there are other cases where consistent repeated errors result in a more precise error, but I think the log you shows is one that is a more unexpected error and results in an INTERNAL gRPC response (such as network failures potentially).

I think this is just one of those cases where there is enough load going in to the system that even the retries / load-shedding aren't enough to keep the system from falling over. To handle this load, the system probably needs to be scaled up.

Also to note, the internal retry limit is not configurable.

yc2984 commented 1 year ago

@tjohnson31415 Hey thanks for the reply.

To handle this load, the system probably needs to be scaled up.

Do you have any suggestions? What configurations can I tweak? I don't see any resource (cpu/memory) related issues on any of the model serving containers: mlserver, mm, mlserver-adapter, restProxy (which we don't use) or modelmesh controller.

tjohnson31415 commented 1 year ago

Hmm, yeah hard to say if there aren't any apparent resource constraints 🤔. I think we'll need to find some more details as to what is causing the failures.

How many different models are in active use? Can they all fit into the memory available in the Runtime or is there a lot of churn in the models being used?

Are there any restarts for containers in any of the ServingRuntime pods?

If you look at the mm logs for the enrxorg003-5514162725715968 model id, are there logs indicating what is causing the internal failures/retries aside from the overall "retry iterations exhausted"?

yc2984 commented 1 year ago

@tjohnson31415

How many different models are in active use? Can they all fit into the memory available in the Runtime or is there a lot of churn in the models being used?

There are 2500 model, each ~170MB (425 GB), so definitely don't fit into the available memory (22 GB).

Are there any restarts for containers in any of the ServingRuntime pods?

No

If you look at the mm logs for the enrxorg003-5514162725715968 model id, are there logs indicating what is causing the internal failures/retries aside from the overall "retry iterations exhausted"?

It might be the estimated size of the model is too small? initial weight 114 units (~1MiB)

Evicted model enrxorg003-5514162725715968__isvc-8e77e8507b from local cache, last used a few seconds ago (1678274048783ms), invoked 0 times

Cache miss for model invocation, held up 2470ms

Model enrxorg003-5514162725715968__isvc-8e77e8507b removed while waiting for locally-triggered load

About to enqueue load for model enrxorg003-5514162725715968__isvc-8e77e8507b with initial weight 113 units (~1MiB), with priority 1678281251264

Eviction triggered for model enrxorg003-5514162725715968__isvc-8e77e8507b (weight=113 / 1MiB)

Evicted model enrxorg003-5514162725715968__isvc-8e77e8507b from local cache, last used a few seconds ago (1678274051264ms), invoked 0 times

Cache miss for model invocation, held up 2512ms

Model enrxorg003-5514162725715968__isvc-8e77e8507b removed while waiting for locally-triggered load

8 retry iterations exhausted for model enrxorg003-5514162725715968__isvc-8e77e8507b

8 retry iterations exhausted

applyModel throwing exception for model enrxorg003-5514162725715968__isvc-8e77e8507b

Remote invocation failed for model enrxorg003-5514162725715968__isvc-8e77e8507b

About to enqueue load for model enrxorg003-5514162725715968__isvc-8e77e8507b with initial weight 114 units (~1MiB), with priority 1678281253831
tjohnson31415 commented 1 year ago

There are 2500 model, each ~170MB (425 GB), so definitely don't fit into the available memory (22 GB).

😮 That's awesome!

It might be the estimated size of the model is too small?

The estimated size will be used to decide where to load the model (find an instance with space). If the size is being underestimated too much, I'd expect that OOMs would become an issue since ModelMesh would schedule more models than could actually fit in memory. Since there aren't OOMs, it may be ok (ie. there's enough buffer to cover the underestimation).

The "eviction" logs for a model used "a few seconds ago" do indicate churn in the models and cacheing. Eviction happens when a model comes in with an inference request and memory space needs to be made for it, so the LRU based cacheing kicks out a model used less recently. If the model was last used a few seconds ago and will be used again soon, there will be a lot of churn in the cache.

Do you have metrics enabled for ModelMesh (see the monitoring docs)? I think that looking at modelmesh_instance_lru_age_seconds would be useful as this will tell when the oldest model in the LRU was last used for inference. If this is on the order of seconds, it means that models are getting evicted very frequently. In other words, it would indicate if there is enough memory space across the pods to hold in memory the active set of models. In our deployments with ModelMesh, we try to keep modelmesh_instance_lru_age_seconds on the order of days, not seconds.

yc2984 commented 1 year ago

@tjohnson31415 No we don't have monitoring set up yet, will set it up soon, thanks for the suggestion.

We've set up 4 replicas of the serving runtime pod, with each has the 4 containers. I attached the logs from just 1 of the 4 pods, hope it helps. Although I see the 8 retry iterations exhausted in other pods as well.

Retry setting:

{
        "methodConfig": [
            {
                "name": [{"service": "inference.GRPCInferenceService"}],
                "retryPolicy": {
                    "maxAttempts": 5,
                    "initialBackoff": "2s",
                    "maxBackoff": "30s",
                    "backoffMultiplier": 3,
                    "retryableStatusCodes": ["UNAVAILABLE", "DEADLINE_EXCEEDED", "CANCELLED", "RESOURCE_EXHAUSTED",
                                               "NOT_FOUND"],
                },
            }
        ]
    }

If this is on the order of seconds, it means that models are getting evicted very frequently. In other words, it would indicate if there is enough memory space across the pods to hold in memory the active set of models.

Question 1: It's curious why the memory usage is relatively low? The total memory limit is 22GB = (mlserver is 2Gi + mm container 1Gi + mlserver-adapter 2Gi + rest proxy 512Mi) * 4, the requested is ~10GB, but the used is only around 3.5GB. I also monitored the individual container's memory during testing, it never had memory issues. And this is the only kind of error I'm getting.

It might be that our load is very uniform, so that the LRU cache doesn't really help that much. The typical load is simply call each of the 2500 model once in a short period of time within minutes.

Question 2: In the logs, I see: about to enqueue load for model enrxorg003-5514162725715968__isvc-8e77e8507b with initial weight 114 units (~1MiB), with priority 1678281253831. What does this 1 MiB mean and later there is Memory over-allocation due to under-prediction of model size. Cache deficit increased by 21990 units to 21990 (172MiB). Can you give some context on this?

Two things we will try:

  1. Test the same load with increased memory limit of mlserver, maybe it would help the frequent eviction issue.
  2. Increase the amount of replicas from 4 to 5 or 6.
Screenshot 2023-03-09 at 12 15 23 Screenshot 2023-03-09 at 12 15 55
tjohnson31415 commented 1 year ago

It might be that our load is very uniform, so that the LRU cache doesn't really help that much. The typical load is simply call each of the 2500 model once in a short period of time within minutes.

Ah, that is an interesting use-case! The typical ModelMesh use-case that I am familiar with has a large number of registered models, but only a subset being used in a given time-window (on the order of hours). Cycling through all the models will definitely work against the LRU nature of the cache and result in churn. I'll have to investigate if there are settings that can be tuned for this use-case. If it is possible to group requests temporally based on the models that they hit, that would help.

Question 1: It's curious why the memory usage is relatively low? The total memory limit is 22GB = (mlserver is 2Gi + mm container 1Gi + mlserver-adapter 2Gi + rest proxy 512Mi) * 4, the requested is ~10GB, but the used is only around 3.5GB. I also monitored the individual container's memory during testing, it never had memory issues. And this is the only kind of error I'm getting.

For the loaded models, the mlserver container memory is the relevant one to monitor memory usage for. Also, the container allocated memory is only part of the story: ModelMesh does its own accounting of memory based on model size estimations. The ServingRuntime tells ModelMesh what its memory capacity for loaded models is. For the built-in MLServer runtime, the adapter computes this value based on the CONTAINER_MEM_REQ_BYTES and MEM_BUFFER_BYTES environment variables (REF). The amount of memory used by a model is also based on estimates/heuristics. The built-in adapter uses the size of the model files on disk to do this estimation (REF).

Depending on the accuracy of the estimations, the real memory usage and the accounted memory usage may diverge such that modelmesh is artificially capping the real memory usage.

In the logs, I see: "about to enqueue load for model enrxorg003-5514162725715968__isvc-8e77e8507b with initial weight 114 units (~1MiB), with priority 1678281253831". What does this 1 MiB mean and later there is "Memory over-allocation due to under-prediction of model size. Cache deficit increased by 21990 units to 21990 (172MiB)". Can you give some context on this?

"Weight" in this context is just estimated memory usage (the cache code deals with "weights" but the value for the weights are set based on memory usage). Before a model is loaded, there is a predicted memory size that is used to decide which instance to load it in. After a model is loaded is when a more precise estimation can happen. The "initial weight" of 1MiB is coming from the pre-load model size estimation, which I think is just using a coded default model size. The heuristic based on disk size is used after the model is loaded.

I'm a bit fuzzy on the details of the cache management and "deficit". From this comment on cacheDeficit, I gather that the deficit occurs when the post-load estimation exceeds the pre-load estimation resulting in a total that exceeds the cache limit. At some point, the deficit is recovered by evicting models to get the memory back below the limit.


Increasing the memory size of the runtime should help with this. To increase the capacity for loaded models in the mesh, you should increase the resources.requests.memory for the mlserver container. That value is what the adapter uses to set CONTAINER_MEM_REQ_BYTES. Most of the parameters for model size estimation are not easily configurable (only memBufferBytes is in the builtInAdapter config in the ServingRuntime spec), so increasing the memory per pod or increasing the number of pods are the easiest "quick fix".

MLHafizur commented 1 year ago

Hello @yc2984, Which runtime are you using?

yc2984 commented 1 year ago

@MLHafizur scikit-learn from MLServer

yc2984 commented 1 year ago

@tjohnson31415 thanks so much for the in-depth explanations. We will do some testing and get back to you!