Closed pandu-k closed 1 year ago
To do: Prepare a design doc on this
Neural networks models are used in the Server
to vectorise the inputs into tensors. Models that are called by the vectorise
function will be loaded into devices (cuda or cpu).
Currently, we do not have a method to manage the loaded models. Specifically, end users do not have a way to
1) know what modes are loaded,
2) eject a loaded model,
This design is proposed to add these features
We split the design into two parts, client
and server
.
For the client, this would be like
from marqo import Client as mq
# load the model into a specific device
mq.load_model(model_name = "ViT-L/14", device = "cpu")
# eject a model from a specific device
mq.eject_model(model_name = "ViT-L/14", device = "cuda")
# get the information about the loaded models in a specific device
# if the device is not given, we return the models loaded in "cuda" and "cpu"
mq.loaded_models(device = "cuda")
The expect result of mq.loaded_models()
should be a dictionary
{”cuda”:
{model_name : model_info,
# an example
”ViT-L/14” : “a mutli-modal model from openai that can encode both text and image”
},
“cpu”:
{model_name : model_info,
# an example
”onnx/ViT-L/14” : “an onnx version of ViT-L/14 that has faster inference speed.”
},
}
As for mq.load_model
and mq.eject_model
, the given model will be loaded or ejected from the given device and mq.loaded_moels
will be called later.
When mq.load_model
is called, there is a possibility that model can not be loaded into the specific device due to the out-of-memory issue. We have two ways to solve this problem:
error
and ask the end user to eject a model before loading this one.All the cached models are store in module marqo.s2_inference.s2_inference
dictionary variable available_models
We may need to rewrite this variable so it meets the format described above in mq.loaded_models()
.
A queue (or two queues for “cuda” and “cpu”) is need to implement mq.eject()
if we plan to use option 2.
@wanliAlex Thanks for the doc.
Feedback:
mq.get_loaded_models()
to be consistent with existing methods like get_indexes()
mq.eject_model()
and mq.get_loaded_models()
available_models
before the customer receives it, such that it fits the proposed structure, meaning we don't need to rewrite the dictionaryNext steps
get_loaded_models()
method? For example, where would the description of each model come from? Can we get the model properties too (such as dimensions
and type
)?tensor_search
and s2_inference
for this API to be implemented.Neural networks models are used in the Server
to vectorise the inputs into tensors. Models that are called by the vectorise
function will be loaded into devices (cuda or cpu).
Currently, we do not have a method to manage the loaded models. Specifically, end users do not have a way to
1) know what modes are loaded,
2) eject a loaded model,
This design is proposed to add these features
We split the design into two parts, client
and server
.
For the client, this would be like
from marqo import Client as mq
# eject a model from a specific device
mq.eject_model(model_name = "ViT-L/14", device = "cuda")
# get the information about the loaded models in a specific device
# if the device is not given, we return the models loaded in "cuda" and "cpu"
mq.get_loaded_models(device = "cuda")
The expect result of mq.get_loaded_models()
should be a dictionary
{”cuda”:
{model_name : model_info,
# an example
”ViT-L/14” : “a mutli-modal model from openai that can encode both text and image”
},
“cpu”:
{model_name : model_info,
# an example
”onnx/ViT-L/14” : “an onnx version of ViT-L/14 that has faster inference speed.”
},
}
As for mq.eject_model
, the given model will be loaded or ejected from the given device, and mq.get_loaded_moels
will be called later.
When mq.add_document()
is called, there is a possibility that the model can not be loaded into the specific device due to the out-of-memory issue.
Note: We are not going to solve the OOM error in this issue!!
We have two ways to solve this problem:
error
and ask the end user to eject a model before loading this one. (current choice)All the cached models are stored in module marqo.s2_inference.s2_inference
dictionary variable available_models
When mq.get_loaded_models()
is called, we need to return the dictionary and reformat it to get the expected outputs in the client.
The detailed model information is originally stored in the module marqo.s2_inference.model_registry
and it can be accessed in variable MODEL_PROPERTIES
in marqo.s2_inference.s2_inference
The information of each model is stored as a dictionary entry like this:
'open_clip/RN50/openai':
{'name': 'open_clip/RN50/openai',
'dimensions': 1024,
'note': 'clip model from open_clip implementation',
'type': 'open_clip',
'pretrained': 'openai'},
In addition to this, we may also need to provide the device = "cpu" or "cuda"
information in which the model is cached. We can read this information from the attribute of each model model.device
As described, all the information are already stored in Marqo. I don’t think we need to update or add any function in tensor_search
or s2_inference
to access these information.
I need to go through the https request parsing part to check the required functions here.
Thanks for the update @wanliAlex !
A detail you may need to be aware of:
In this PR, we add 3 new APIs:
get_loaded_models()
: get the loaded models in both “cpu” and “cuda”eject_model()
: eject a model from a specific deviceget_cuda_info()
: get the memory used in device=”cuda”The endpoint results are shown as below:
# After we start Marqo, we have the pre-loaded models as:
curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cpu"},{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"},{"model_name":"ViT-L/14","device":"cuda"}]}
# We can check the cuda usage as:
curl -XGET http://localhost:8882/models/cuda
{"results":"You are using 1.7|14.6GiB on device=cuda"}
# We can eject a model from "cuda" by:
curl -X DELETE 'http://localhost:8882/models?model_name=ViT-L/14&model_device=cuda'
{"message":"eject model success, eject model_name = ViT-L/14 from device = cuda "}
# We can check the loaded models and cuda usage again:
curl -XGET http://localhost:8882/models/cuda
{"results":"You are using 0.1|14.6GB on device=cuda"}
curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cpu"},{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"}]}
# We can also eject a model from CPU by:
curl -X DELETE 'http://localhost:8882/models?model_name=hf/all_datasets_v4_MiniLM-L6&model_device=cpu'
{"message":"eject model success, eject model_name = hf/all_datasets_v4_MiniLM-L6 from device = cpu "}
curl -XGET http://localhost:8882/models
{"results":[{"model_name":"hf/all_datasets_v4_MiniLM-L6","device":"cuda"},{"model_name":"ViT-L/14","device":"cpu"}]}
# If we try to eject a model that is not loaded in the given device, we will get a 404 error with returned message:
curl -X DELETE 'http://localhost:8882/models?model_name=void&model_device=void'
{"message":"The model_name=ViT-L/14 device=cuda is not loaded","code":"index_not_found","type":"invalid_request","link":null}
INFO: 127.0.0.1:42288 - "DELETE /models?model_name=ViT-L/14&model_device=cuda HTTP/1.1" 404 Not Found
Can I get your opinions on this design? @pandu-k @Jeadie @jn2clark @tomhamer
Thanks @wanliAlex ! One piece of feedback. For the eject endpoint, can you give the result of the operation like this:
result: successful
Also, to be consistent with other API calls, the message is better like this:
successfully ejected model `hf/all_datasets_v4_MiniLM-L6` from device `cpu`
Another piece of feedback: The DELETE endpoint is giving code index_not_found
if the model isn't found. Perhaps create a new error (based on the IndexNotFound error) like ModelNotFound
Also, for curl -XGET http://localhost:8882/models
can we rename results
to models
:
{"models":[{"model_name":"hf/all_datasets_v4_MiniLM-L6", ...
Device may be better to have its own endpoint. Also, make the fields be as programatically usable as possible:
curl -XGET http://localhost:8882/device/cuda
{"memory_usage":"1.7 GiB", "total_device_memory": "14.6 GiB"}
Another piece of feedback: The DELETE endpoint is giving code
index_not_found
if the model isn't found. Perhaps create a new error (based on the IndexNotFound error) like ModelNotFound
Thanks for the feedback. I may also create an s2_inference error named ModelNotLoaded.
Implemented in https://github.com/marqo-ai/marqo/pull/239
Use cases to implement:
Edge cases: