marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
https://www.marqo.ai/
Apache License 2.0
4.57k stars 188 forks source link

[ENHANCEMENT] Automatically eject a model in cache #343

Open wanliAlex opened 1 year ago

wanliAlex commented 1 year ago

Is your feature request related to a problem? Please describe. we would like the Marqo instance can manage the RAM and CUDA memory automatically. It should be able to eject a model when too many models are loaded.

design docs will be provided shortly.

wanliAlex commented 1 year ago

Automatically eject a model in the cache

Intro

Currently, Marqo does not have an automatic way to manage the loaded models. This will cause memory issues for cuda and cpu. Our manual method eject model API is not working properly in the cloud. We need to have an automatic way to manage the models in Marqo.

All the models are store in available_models with the following format:

avaiable_models = {"model_cache_key_1": Model_1,
                   "model_cache_key_2": Model_2,}

Removing the model from memory is just del Mode_1. Note that models can be stored both in RAM and cuda.

Strategy

We need to select which metric we should focus on to decide if a model should be ejected, and when we should evaluate this metric (checkpoint).

Metrics

There are two metrics to decide whether we need to eject a model:

1. Memory usage based

We focus on the memory usage of cuda and cpu and set a threshold for it. If the used memory is higher than the threshold, it means one or several models need to be deleted from the cache. An important part is how to measure the memory usage or cuda and cpu. For cuda, we can use torch.cuda.memory_allocated(device=None) to return the memory usage of different cuda devices (if we have multi-gpus). As for cpu, the package I find is psutil , which can measure the RAM usage percentage.

2. Number of models based

We can eject a model whenever the number of models in the available_modes reaches a threshold. We also need to have separated thresholds for models in cpu and cuda.

Checkpoint

We need to decide when we do the metric checking and eject a model. Here are two main methods.

1. Time interval based

We do a metric check whenever a fixed amount of time passes.

2. Model loading based

We do a check whenever we want to load a new model into the cache.

3. Mixture of both

We can mix method 1 and method 2.

Model ejection algorithm

Here we discuss which model to eject. Note that there are 3 different types of models in Marqo, 1. Inference models (sbert, clip), 2. Preprocessing models (marqo-yolo), 3. Reranking models (owl/ViT-B/32).

We can use a separate list model_priority_list to store all themodel_cache_keys along with the available_models. The time complexity of ejecting a model is $O(n)$, where $n$ is the length of the list. The code is

# if we have
available_models = {
"vit-l/14" : "dummy",
"vit-b/32" : "dummy",
"vit-b/16" : "dummy",
}
# the model_priority_list will be
model_priority_list = ["vit-l/14", "vit-b/32", "vit-b/16"]

# if we want to re-load the model "vit-b/32"
if "vit-b/32" in avaiable_models:
    model_priority_list.remove("vit-b/32")
    model_priority_list.append("vit-b/32")

# if we want to load model "vit-g/14"
if not "vit-g/14" in available_models:
    available_models["vit-g/14"] = "dummy"
    model_priority_list.append("vit-g/14")
# we may also need to check if
# set(available_models) == set(model_priority_list)
assert set(available_models) == set(model_priority_list)

If we want to eject a model, we have

# eject a model based on some conditions
if len(available_models) > threshold:
    model_to_eject = model_priority_list[0]
    del available_model[model_to_eject]
wanliAlex commented 1 year ago

@tomhamer