Open wanliAlex opened 1 year ago
Currently, Marqo does not have an automatic way to manage the loaded models. This will cause memory issues for cuda
and cpu
. Our manual method eject model
API is not working properly in the cloud. We need to have an automatic way to manage the models in Marqo.
All the models are store in available_models
with the following format:
avaiable_models = {"model_cache_key_1": Model_1,
"model_cache_key_2": Model_2,}
Removing the model from memory is just del Mode_1
. Note that models can be stored both in RAM and cuda.
We need to select which metric we should focus on to decide if a model should be ejected, and when we should evaluate this metric (checkpoint).
There are two metrics to decide whether we need to eject a model:
We focus on the memory usage of cuda
and cpu
and set a threshold
for it. If the used memory is higher than the threshold, it means one or several models need to be deleted from the cache.
An important part is how to measure the memory usage or cuda
and cpu
. For cuda, we can use torch.cuda.memory_allocated(device=None)
to return the memory usage of different cuda devices (if we have multi-gpus).
As for cpu
, the package I find is psutil
, which can measure the RAM usage percentage.
We can eject a model whenever the number of models in the available_modes
reaches a threshold. We also need to have separated thresholds for models in cpu
and cuda
.
We need to decide when we do the metric checking and eject a model. Here are two main methods.
We do a metric check whenever a fixed amount of time passes.
We do a check whenever we want to load a new model into the cache.
We can mix method 1 and method 2.
Here we discuss which model to eject. Note that there are 3 different types of models in Marqo, 1. Inference models (sbert, clip), 2. Preprocessing models (marqo-yolo), 3. Reranking models (owl/ViT-B/32).
We can use a separate list model_priority_list
to store all themodel_cache_keys
along with the available_models
. The time complexity of ejecting a model is $O(n)$, where $n$ is the length of the list. The code is
# if we have
available_models = {
"vit-l/14" : "dummy",
"vit-b/32" : "dummy",
"vit-b/16" : "dummy",
}
# the model_priority_list will be
model_priority_list = ["vit-l/14", "vit-b/32", "vit-b/16"]
# if we want to re-load the model "vit-b/32"
if "vit-b/32" in avaiable_models:
model_priority_list.remove("vit-b/32")
model_priority_list.append("vit-b/32")
# if we want to load model "vit-g/14"
if not "vit-g/14" in available_models:
available_models["vit-g/14"] = "dummy"
model_priority_list.append("vit-g/14")
# we may also need to check if
# set(available_models) == set(model_priority_list)
assert set(available_models) == set(model_priority_list)
If we want to eject a model, we have
# eject a model based on some conditions
if len(available_models) > threshold:
model_to_eject = model_priority_list[0]
del available_model[model_to_eject]
@tomhamer
Is your feature request related to a problem? Please describe. we would like the Marqo instance can manage the RAM and CUDA memory automatically. It should be able to eject a model when too many models are loaded.
design docs will be provided shortly.