marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
https://www.marqo.ai/
Apache License 2.0
4.57k stars 188 forks source link

[ENHANCEMENT] Model Cache Management #204

Closed pandu-k closed 1 year ago

pandu-k commented 1 year ago

Use cases to implement:

Edge cases:

pandu-k commented 1 year ago

To do: Prepare a design doc on this

wanliAlex commented 1 year ago

Model Cache API

Overview

Neural networks models are used in the Server to vectorise the inputs into tensors. Models that are called by the vectorise function will be loaded into devices (cuda or cpu).

Currently, we do not have a method to manage the loaded models. Specifically, end users do not have a way to

1) know what modes are loaded,

2) eject a loaded model, 

This design is proposed to add these features

Proposed Design

We split the design into two parts, client and server .

Client

For the client, this would be like

from marqo import Client as mq

# load the model into a specific device
mq.load_model(model_name = "ViT-L/14", device = "cpu") 

# eject a model from a specific device
mq.eject_model(model_name = "ViT-L/14", device = "cuda")

# get the information about the loaded models in a specific device
# if the device is not given, we return the models loaded in "cuda" and "cpu"
mq.loaded_models(device = "cuda")

The expect result of mq.loaded_models() should be a dictionary

{”cuda”:

    {model_name :  model_info,

    # an example
     ”ViT-L/14” : “a mutli-modal model from openai that can encode both text and image”
    }, 

“cpu”:

    {model_name :  model_info,

    # an example
    ”onnx/ViT-L/14” : “an onnx version of ViT-L/14 that has faster inference speed.”
    }, 

}

As for mq.load_model and mq.eject_model, the given model will be loaded or ejected from the given device and mq.loaded_moels will be called later.

When mq.load_model is called, there is a possibility that model can not be loaded into the specific device due to the out-of-memory issue. We have two ways to solve this problem:

  1. Raise an error and ask the end user to eject a model before loading this one.
  2. Automatically eject a model based on some algorithm. (a queue might be used here so model follows a first-load-first-eject principle)

Marqo

All the cached models are store in module marqo.s2_inference.s2_inference dictionary variable available_models

We may need to rewrite this variable so it meets the format described above in mq.loaded_models().

A queue (or two queues for “cuda” and “cpu”) is need to implement mq.eject() if we plan to use option 2.

pandu-k commented 1 year ago

@wanliAlex Thanks for the doc.

Feedback:

Next steps

wanliAlex commented 1 year ago

Overview

Neural networks models are used in the Server to vectorise the inputs into tensors. Models that are called by the vectorise function will be loaded into devices (cuda or cpu).

Currently, we do not have a method to manage the loaded models. Specifically, end users do not have a way to

This design is proposed to add these features

Proposed Design

We split the design into two parts, client and server .

Client

For the client, this would be like

from marqo import Client as mq

# eject a model from a specific device
mq.eject_model(model_name = "ViT-L/14", device = "cuda")

# get the information about the loaded models in a specific device
# if the device is not given, we return the models loaded in "cuda" and "cpu"
mq.get_loaded_models(device = "cuda")

The expect result of mq.get_loaded_models() should be a dictionary

{”cuda”:

    {model_name :  model_info,

    # an example
     ”ViT-L/14” : “a mutli-modal model from openai that can encode both text and image”
    }, 

“cpu”:

    {model_name :  model_info,

    # an example
    ”onnx/ViT-L/14” : “an onnx version of ViT-L/14 that has faster inference speed.”
    }, 

}

As for mq.eject_model, the given model will be loaded or ejected from the given device, and mq.get_loaded_moels will be called later.

When mq.add_document() is called, there is a possibility that the model can not be loaded into the specific device due to the out-of-memory issue.
Note: We are not going to solve the OOM error in this issue!!

We have two ways to solve this problem:

  1. Raise an error and ask the end user to eject a model before loading this one. (current choice)
  2. Automatically eject a model based on some algorithm. (a queue might be used here so the model follows a first-load-first-eject principle) (left for future work)

Marqo

All the cached models are stored in module marqo.s2_inference.s2_inference dictionary variable available_models

When mq.get_loaded_models() is called, we need to return the dictionary and reformat it to get the expected outputs in the client.

Returned Information

The detailed model information is originally stored in the module marqo.s2_inference.model_registry and it can be accessed in variable MODEL_PROPERTIES in marqo.s2_inference.s2_inference

The information of each model is stored as a dictionary entry like this:

'open_clip/RN50/openai':
           {'name': 'open_clip/RN50/openai',
            'dimensions': 1024,
            'note': 'clip model from open_clip implementation',
            'type': 'open_clip',
            'pretrained': 'openai'},

In addition to this, we may also need to provide the device = "cpu" or "cuda" information in which the model is cached. We can read this information from the attribute of each model model.device

Updated/added function

As described, all the information are already stored in Marqo. I don’t think we need to update or add any function in tensor_search or s2_inference to access these information.

I need to go through the https request parsing part to check the required functions here.

pandu-k commented 1 year ago

Thanks for the update @wanliAlex !

A detail you may need to be aware of:

wanliAlex commented 1 year ago

Features Added

In this PR, we add 3 new APIs:

  1. get_loaded_models() : get the loaded models in both “cpu” and “cuda”
  2. eject_model() : eject a model from a specific device
  3. get_cuda_info() : get the memory used in device=”cuda”

The endpoint results are shown as below:

# After we start Marqo, we have the pre-loaded models as:
curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cpu"},{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"},{"model_name":"ViT-L/14","device":"cuda"}]}

# We can check the cuda usage as:
curl -XGET http://localhost:8882/models/cuda
{"results":"You are using 1.7|14.6GiB on device=cuda"}

# We can eject a model from "cuda" by:
curl -X DELETE 'http://localhost:8882/models?model_name=ViT-L/14&model_device=cuda'
{"message":"eject model success, eject model_name = ViT-L/14 from device = cuda "}

# We can check the loaded models and cuda usage again:
curl -XGET http://localhost:8882/models/cuda
{"results":"You are using 0.1|14.6GB on device=cuda"}

curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cpu"},{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"}]}

# We can also eject a model from CPU by:
curl -X DELETE 'http://localhost:8882/models?model_name=hf/all_datasets_v4_MiniLM-L6&model_device=cpu'
{"message":"eject model success, eject model_name = hf/all_datasets_v4_MiniLM-L6 from device = cpu "}

curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"}]}

# If we try to eject a model that is not loaded in the given device, we will get a 404 error with returned message:
curl -X DELETE 'http://localhost:8882/models?model_name=void&model_device=void'
{"message":"The model_name=ViT-L/14 device=cuda is not loaded","code":"index_not_found","type":"invalid_request","link":null}

INFO:     127.0.0.1:42288 - "DELETE /models?model_name=ViT-L/14&model_device=cuda HTTP/1.1" 404 Not Found

Can I get your opinions on this design? @pandu-k @Jeadie @jn2clark @tomhamer

pandu-k commented 1 year ago

Thanks @wanliAlex ! One piece of feedback. For the eject endpoint, can you give the result of the operation like this: result: successful Also, to be consistent with other API calls, the message is better like this:

successfully ejected model `hf/all_datasets_v4_MiniLM-L6` from device `cpu`
pandu-k commented 1 year ago

Another piece of feedback: The DELETE endpoint is giving code index_not_found if the model isn't found. Perhaps create a new error (based on the IndexNotFound error) like ModelNotFound

pandu-k commented 1 year ago

Also, for curl -XGET http://localhost:8882/models can we rename results to models:

{"models":[{"model_name":"hf/all_datasets_v4_MiniLM-L6", ...
pandu-k commented 1 year ago

Device may be better to have its own endpoint. Also, make the fields be as programatically usable as possible:

curl -XGET http://localhost:8882/device/cuda
{"memory_usage":"1.7 GiB", "total_device_memory": "14.6 GiB"}
wanliAlex commented 1 year ago

Another piece of feedback: The DELETE endpoint is giving code index_not_found if the model isn't found. Perhaps create a new error (based on the IndexNotFound error) like ModelNotFound

Thanks for the feedback. I may also create an s2_inference error named ModelNotLoaded.

pandu-k commented 1 year ago

Implemented in https://github.com/marqo-ai/marqo/pull/239