pandu-k commented 1 year ago

Use cases to implement:

user adds a new model to the cache (CREATE)
user deloads a model and removes it from the cache (DELETE)
user can see all the cached models and their properties (READ)

Edge cases:

A user tries to search an index that uses a deleted model

pandu-k commented 1 year ago

To do: Prepare a design doc on this

wanliAlex commented 1 year ago

Model Cache API

Overview

Neural networks models are used in the Server to vectorise the inputs into tensors. Models that are called by the vectorise function will be loaded into devices (cuda or cpu).

Currently, we do not have a method to manage the loaded models. Specifically, end users do not have a way to

1) know what modes are loaded,

2) eject a loaded model,

This design is proposed to add these features

Proposed Design

We split the design into two parts, client and server .

Client

For the client, this would be like

from marqo import Client as mq

# load the model into a specific device
mq.load_model(model_name = "ViT-L/14", device = "cpu") 

# eject a model from a specific device
mq.eject_model(model_name = "ViT-L/14", device = "cuda")

# get the information about the loaded models in a specific device
# if the device is not given, we return the models loaded in "cuda" and "cpu"
mq.loaded_models(device = "cuda")

The expect result of mq.loaded_models() should be a dictionary

{”cuda”:

    {model_name :  model_info,

    # an example
     ”ViT-L/14” : “a mutli-modal model from openai that can encode both text and image”
    }, 

“cpu”:

    {model_name :  model_info,

    # an example
    ”onnx/ViT-L/14” : “an onnx version of ViT-L/14 that has faster inference speed.”
    }, 

}

As for mq.load_model and mq.eject_model, the given model will be loaded or ejected from the given device and mq.loaded_moels will be called later.

When mq.load_model is called, there is a possibility that model can not be loaded into the specific device due to the out-of-memory issue. We have two ways to solve this problem:

Raise an error and ask the end user to eject a model before loading this one.
Automatically eject a model based on some algorithm. (a queue might be used here so model follows a first-load-first-eject principle)

Marqo

All the cached models are store in module marqo.s2_inference.s2_inference dictionary variable available_models

We may need to rewrite this variable so it meets the format described above in mq.loaded_models().

A queue (or two queues for “cuda” and “cpu”) is need to implement mq.eject() if we plan to use option 2.

pandu-k commented 1 year ago

@wanliAlex Thanks for the doc.

Feedback:

The method for getting the models is better named as mq.get_loaded_models() to be consistent with existing methods like get_indexes()
In the first iteration we will only implement mq.eject_model() and mq.get_loaded_models()
We will postpone auto-ejecting models to a future iteration.
We can reformat a copy of available_models before the customer receives it, such that it fits the proposed structure, meaning we don't need to rewrite the dictionary

Next steps

Note that https://github.com/marqo-ai/marqo/pull/179 extends the model cache - this PR may be merged shortly. This PR will probably affect this model cache management feature.
Can you find out what information is available about each model and where it would come from, as returned through the get_loaded_models() method? For example, where would the description of each model come from? Can we get the model properties too (such as dimensions and type)?
Describe the functions that need to be added/updated to tensor_search and s2_inference for this API to be implemented.

wanliAlex commented 1 year ago

Overview

Neural networks models are used in the Server to vectorise the inputs into tensors. Models that are called by the vectorise function will be loaded into devices (cuda or cpu).

Currently, we do not have a method to manage the loaded models. Specifically, end users do not have a way to

1) know what modes are loaded,

2) eject a loaded model,

This design is proposed to add these features

Proposed Design

We split the design into two parts, client and server .

Client

For the client, this would be like

from marqo import Client as mq

# eject a model from a specific device
mq.eject_model(model_name = "ViT-L/14", device = "cuda")

# get the information about the loaded models in a specific device
# if the device is not given, we return the models loaded in "cuda" and "cpu"
mq.get_loaded_models(device = "cuda")

The expect result of mq.get_loaded_models() should be a dictionary

{”cuda”:

    {model_name :  model_info,

    # an example
     ”ViT-L/14” : “a mutli-modal model from openai that can encode both text and image”
    }, 

“cpu”:

    {model_name :  model_info,

    # an example
    ”onnx/ViT-L/14” : “an onnx version of ViT-L/14 that has faster inference speed.”
    }, 

}

As for mq.eject_model, the given model will be loaded or ejected from the given device, and mq.get_loaded_moels will be called later.

When mq.add_document() is called, there is a possibility that the model can not be loaded into the specific device due to the out-of-memory issue.
Note: We are not going to solve the OOM error in this issue!!

We have two ways to solve this problem:

Raise an error and ask the end user to eject a model before loading this one. (current choice)
Automatically eject a model based on some algorithm. (a queue might be used here so the model follows a first-load-first-eject principle) (left for future work)

Marqo

All the cached models are stored in module marqo.s2_inference.s2_inference dictionary variable available_models

When mq.get_loaded_models() is called, we need to return the dictionary and reformat it to get the expected outputs in the client.

Returned Information

The detailed model information is originally stored in the module marqo.s2_inference.model_registry and it can be accessed in variable MODEL_PROPERTIES in marqo.s2_inference.s2_inference

The information of each model is stored as a dictionary entry like this:

'open_clip/RN50/openai':
           {'name': 'open_clip/RN50/openai',
            'dimensions': 1024,
            'note': 'clip model from open_clip implementation',
            'type': 'open_clip',
            'pretrained': 'openai'},

In addition to this, we may also need to provide the device = "cpu" or "cuda" information in which the model is cached. We can read this information from the attribute of each model model.device

Updated/added function

As described, all the information are already stored in Marqo. I don’t think we need to update or add any function in tensor_search or s2_inference to access these information.

I need to go through the https request parsing part to check the required functions here.

pandu-k commented 1 year ago

Thanks for the update @wanliAlex !

A detail you may need to be aware of:

For generic models, https://github.com/marqo-ai/marqo/pull/179, their properties will be found in the model cache, rather than from the model registry. Please make some tests around this use case

wanliAlex commented 1 year ago

Features Added

In this PR, we add 3 new APIs:

get_loaded_models() : get the loaded models in both “cpu” and “cuda”
eject_model() : eject a model from a specific device
get_cuda_info() : get the memory used in device=”cuda”

The endpoint results are shown as below:

# After we start Marqo, we have the pre-loaded models as:
curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cpu"},{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"},{"model_name":"ViT-L/14","device":"cuda"}]}

# We can check the cuda usage as:
curl -XGET http://localhost:8882/models/cuda
{"results":"You are using 1.7|14.6GiB on device=cuda"}

# We can eject a model from "cuda" by:
curl -X DELETE 'http://localhost:8882/models?model_name=ViT-L/14&model_device=cuda'
{"message":"eject model success, eject model_name = ViT-L/14 from device = cuda "}

# We can check the loaded models and cuda usage again:
curl -XGET http://localhost:8882/models/cuda
{"results":"You are using 0.1|14.6GB on device=cuda"}

curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cpu"},{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"}]}

# We can also eject a model from CPU by:
curl -X DELETE 'http://localhost:8882/models?model_name=hf/all_datasets_v4_MiniLM-L6&model_device=cpu'
{"message":"eject model success, eject model_name = hf/all_datasets_v4_MiniLM-L6 from device = cpu "}

curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"}]}

# If we try to eject a model that is not loaded in the given device, we will get a 404 error with returned message:
curl -X DELETE 'http://localhost:8882/models?model_name=void&model_device=void'
{"message":"The model_name=ViT-L/14 device=cuda is not loaded","code":"index_not_found","type":"invalid_request","link":null}

INFO:     127.0.0.1:42288 - "DELETE /models?model_name=ViT-L/14&model_device=cuda HTTP/1.1" 404 Not Found

Can I get your opinions on this design? @pandu-k @Jeadie @jn2clark @tomhamer

pandu-k commented 1 year ago

Thanks @wanliAlex ! One piece of feedback. For the eject endpoint, can you give the result of the operation like this: result: successful Also, to be consistent with other API calls, the message is better like this:

successfully ejected model `hf/all_datasets_v4_MiniLM-L6` from device `cpu`

pandu-k commented 1 year ago

Another piece of feedback: The DELETE endpoint is giving code index_not_found if the model isn't found. Perhaps create a new error (based on the IndexNotFound error) like ModelNotFound

pandu-k commented 1 year ago

Also, for curl -XGET http://localhost:8882/models can we rename results to models:

{"models":[{"model_name":"hf/all_datasets_v4_MiniLM-L6", ...

pandu-k commented 1 year ago

Device may be better to have its own endpoint. Also, make the fields be as programatically usable as possible:

curl -XGET http://localhost:8882/device/cuda
{"memory_usage":"1.7 GiB", "total_device_memory": "14.6 GiB"}

wanliAlex commented 1 year ago

Another piece of feedback: The DELETE endpoint is giving code index_not_found if the model isn't found. Perhaps create a new error (based on the IndexNotFound error) like ModelNotFound

Thanks for the feedback. I may also create an s2_inference error named ModelNotLoaded.

pandu-k commented 1 year ago

Implemented in https://github.com/marqo-ai/marqo/pull/239

marqo-ai / marqo

[ENHANCEMENT] Model Cache Management #204

Model Cache API

Overview

Proposed Design

Client

Marqo

Overview

Proposed Design

Client

Marqo

Returned Information

Updated/added function

Features Added