[FEATURE] Proposal - Deep Learning Model Uploading and Inference

jngz-es commented 2 years ago

What / Why

What are we building?

The ML-Commons plugin allows user to upload deep learning models and use them to make inference.

Why are we building it?

In many domains like NLP, computer vision etc. the deep learning algorithms overwhelm the traditional machine learning algorithms. For OpenSearch customers who are planning to use machine leaning algorithms or using some traditional machine learning algorithms for their bussiness, deep learning is definitely a good way to improve their systems.

How do we know?

What matters most to OpenSearch?

Bring deep learning models into OpenSearch. Deep learning is increasingly affecting our lives. More and more customers want to use deep learning technology to improve their systems. This feature brings an opportunity for customers to apply deep learning models in OpenSearch for their business.

What does the customer experience look like?

Train the model through a deep learning framework, like Pytorch.
Save the model to TorchScript.
Upload the model through ML-Commons plugin API.
Run the predict API provided by ML-Commons plugin to get the result.

High-level user stories:

As a forecasting user, I want to apply deep learning algorithm like deepAR so that I can improve the forecasting accuracy for my business.
As a NLP user, I want to apply deep learning algorithms so that I can leverage the state-of-the-art NLP technologies to improve my systems.

What are the risks and assumptions?

The format of model

There are pretty many deep learning frameworks to train models, like PyTroch, TensorFlow, MXNet etc. They are not compatible for the model’s format. We could start with TorchScript and ONNX as PyTorch is a very popular deep learning framework and ONNX is a standard format across various deep learning frameworks.

The computation cost

We have the limitation on the size of model. Supporting distributed inference or not need to be considered.

Dedicated ML node with GPU

If no dedicated node, we should have more restrict limitation on resources for the models.

ARM support

Support like AWS Graviton.

Craigacp commented 2 years ago

You could look at ONNX Runtime for deploying ONNX models. It has a Java API and is used in production in Java in a number of companies (and also in other search engines). It's designed specifically for inference, and is frequently faster than pytorch. There are a number of configuration options for constraining the execution environment it runs in, and it can exploit GPUs if they are available. We have a wrapper around it in Tribuo so you could maintain uniformity in the interfaces if that's helpful, though going through Tribuo's interfaces would add some overhead for dense data vs using ORT directly from Java.

(Full disclosure: I wrote the initial version of the ONNX Runtime Java API and now maintain it in the ONNX Runtime project)

jngz-es commented 2 years ago

Thanks Adam for your comments. You are right, actually we already put ONNXRuntime on our plan and we are still evaluating it.

Craigacp commented 2 years ago

Ok. Let me know if there are issues with the way the Java bits work, we can fix them (though ONNX Runtime is on a quarterly release cycle so new features will have to wait till the next release). For example we've not added support for writing outputs to preallocated user buffers yet and that may be relevant for high throughput applications (on CPUs anyway, there's an unavoidable copy to get any GPU results back).

elfisher commented 2 years ago

@jngz-es thanks for putting this proposal together. I'm very excited about this work. I have some questions/comments:

Are we planning to extend the REST API to support uploading without cluster restarts?
What are the different methods we want to expose?
If we are doing the REST API, I'd suggest we explore building a UI in dashboards to facilitate the upload/management.
How does this fit in with something like using Hugging Face's Pytorch Transformers?

@kgcreative would love your thoughts ^^^

ylwu-amzn commented 2 years ago

Thanks for your comments @elfisher . Add my thoughts here.

Are we planning to extend the REST API to support uploading without cluster restarts?

What are the different methods we want to expose?

Yes. We plan to build several new APIs to support custom model:

User can use upload model API to upload their own model to OpenSearch (will split the model into smaller chunks and save to ML model index).
Then call load model API to load model into memory.
Once model loaded, they can call some inference API.
If model no longer needed, user can call unload model API to remove model from memory.

If we are doing the REST API, I'd suggest we explore building a UI in dashboards to facilitate the upload/management.

That's a good suggestion. And we do plan to build some UI to support model management/uploading. But considering our resources, we may plan tasks in phases. How about we prioritize the REST API first, then UI if no enough resources to work in parallel?

How does this fit in with something like using Hugging Face's Pytorch Transformers?

Good point. Huggingface model is considered. If we don't have enough resources, we plan to support some Huggingface model first like some NLP models. We can extend the scope to support more models later. What do you think?

kgcreative commented 2 years ago

@jngz-es - thanks for putting this together!

Some questions for me as well:

Model CRUD operations could definitely benefit from having a management UI to make it easier to manage, load and operate models. Have we thought about building an admin/ML interface in Dashboards?
Do we forsee users wanting to load/unload models after some condition is met? (number of runs, time after execution, period of inactivity, schedule?)

asfoorial commented 2 years ago

Is there a reason for this to be implemented in Java?

Why not dedicate a node for inference and have it implemented in Python. It can be an internal node so users call the OpenSearch APIs through master nodes which delegate requests to inference nodes just like search requests are delegated to data nodes. Having an inference python node will broaden the spectrum of ML/DL capabilities to include.

elfisher commented 2 years ago

I saw this is moved to 2.4 on the roadmap. I'm updating the label accordingly and adding the roadmap label

alexahorgan commented 2 years ago

Feature Review check-in notes: The team will work with the docs team to create the corresponding doc issue, and document the type of model that will be supported. Additionally, feedback was given to enable GPU acceleration for deep learning. The UX action item is to work on user-flow requirements in parallel while the API work is occurring.

diegodorgam commented 2 years ago

just to check, have you guy laid out the possibility of using something like https://www.deepdetect.com/ ? it has a few tutorial on how to integrate deep detect outputs directly to elastic.

ylwu-amzn commented 2 years ago

just to check, have you guy laid out the possibility of using something like https://www.deepdetect.com/ ? it has a few tutorial on how to integrate deep detect outputs directly to elastic.

@diegodorgam Not familiar with deepdetect. Can you share the tutorial link? If possible, that will best if you can summarize how deep detect integrate with elastic.

ylwu-amzn commented 2 years ago

Is there a reason for this to be implemented in Java?

Why not dedicate a node for inference and have it implemented in Python. It can be an internal node so users call the OpenSearch APIs through master nodes which delegate requests to inference nodes just like search requests are delegated to data nodes. Having an inference python node will broaden the spectrum of ML/DL capabilities to include.

@asfoorial Very sorry that missed your comment, good suggestion. We thought about this option. We find it challenging to package all these python dependencies and build roburst/safe/scalable solution for all platforms. Maybe it's not so hard to do such customization work for one user manually. But seems not so easy to build easy solution that user can just install and run just like what OpenSearch does now. But welcome any detailed solution for this option.

asfoorial commented 2 years ago

@ylwu-amzn have you considered embedding Python executing within JVM using https://github.com/ninia/jep?

ylwu-amzn commented 2 years ago

@ylwu-amzn have you considered embedding Python executing within JVM using https://github.com/ninia/jep?

Yes, @jngz-es did some research on jep before, it has some compatibility issue with some CPython extensions which may cause JVM crash. So we didn't choose jep.

jngz-es commented 2 years ago

Yes, just like @ylwu-amzn mentioned, we did some testing on jep. From our side, it brought some unavailability issues.

ashim-mahara commented 2 years ago

I think the focus should be directed towards universal runtimes like ONNX which support a large range of other libraries such as TensorFlow and PyTorch. This also enables indirect support for other derivative libraries like HuggingFace Transformers without explicitly writing code for them.

With that being said, sorry I am out of the loop but are there any discussions / progress on the Feature Engineering aspect of Machine Learning? Since ML models expect the data to be in a certain format, there should be in-built transformations that help to transform the data (tokenization, tokens to embeddings, standardization, custom functions?) suitable for the selected model. The case for custom development of select libraries can be made here, such as HuggingFace Transformers, that provide "processors" or "tokenizers" that help in transformation of input into the desired format.

I think I am reiterating some of these from the RFC in #123 (changed my github handle, same guy).

ylwu-amzn commented 2 years ago

hi, @ashim-mahara , thanks a lot for your suggestion. We have released model serving framework as experimental feature in 2.4, check this doc https://opensearch.org/docs/latest/ml-commons-plugin/model-serving-framework/

In 2.4, we support text embedding model and we support the standard huggingface tokenizer. You can find example model in our code(link). Unzip the example model, you can find it contains model torchscript file and tokenzier.json file. The tokenizer.json file is from Huggingface, which defines tokenization logic.

Will the tokenzier.json file meet your requirements? Feel free to share your use cases and add your suggestion here

hamzashabbir11 commented 1 year ago

Can the ML Commons Plugin be used for Getting Better Search Results or other tasks only ?

asfoorial commented 1 year ago

@hamzashabbir11 you can use the neural-search plugin, which internally uses ML Common models, to get better search results

hamzashabbir11 commented 1 year ago

Okay, Thanks for your answer @asfoorial . I am a beginner in OpenSearch. I am using it to get better search results for an E-commerce app. How can I implement neural search and are there any resources you can share?

asfoorial commented 1 year ago

@hamzashabbir11 it is not very straightforward, but you can start here https://opensearch.org/docs/latest/neural-search-plugin/index/

ylwu-amzn commented 1 year ago

hi, all, thanks for your valuable feedback and suggestions.

From 2.5, ml-commons supports running model on GPU, refer to this doc https://opensearch.org/docs/latest/ml-commons-plugin/gpu-acceleration/ or https://github.com/opensearch-project/ml-commons/blob/2.x/docs/model_serving_framework/GPU_support.md

From 2.5, we support ONNX model too. You can find text embedding examples (by 2.5, we only support text embedding models) on this doc https://github.com/opensearch-project/ml-commons/blob/2.x/docs/model_serving_framework/text_embedding_model_examples.md

After model loaded and running in model serving framework, you can use neural search plugin to do semantic search, refer to https://opensearch.org/docs/latest/neural-search-plugin/index/

Note: The model serving framework and neural search still experimental now. Welcome to try and provide your feedback. Let's build a better product together!

YuanBoXie commented 1 year ago

It seems that the future version of OpenSearch plans to provide the function of MLOps. Does it right? Now CRUD for the model has been implemented. However, there are still many problems in model management:

Supporting for multiple model file formats: model from TensorFlow, paddle paddle
Supporting different forms of model input: image, video, graph (custom JSON object), and 3D point cloud

ylwu-amzn commented 1 year ago

@hexbo Thanks for your suggestions, really great points!

We do have plan to support MLOps. Is it something valuable to you? Appreciate if you can share more details and your suggestions for this. Like what function you need, what pain point do you have now etc.

For supporting more model formats and inputs suggestion, yes, we will support more. Do you have some priority list for these? Like which model format/input type we should support first to support your use case?

YuanBoXie commented 1 year ago

@ylwu-amzn Yeah, I always use Tensorflow to develop the DL model. At present, the market share of TensorFlow is also high, I hope that OpenSearch can support the model file of TensorFlow in the recent version update.

ylwu-amzn commented 1 year ago

Thanks @hexbo for sharing this. BTW, can you export your Tensorflow model to ONNX? From 2.5 release, we support ONNX. We only support text embedding model now (by 2.5). Do you have requirements for other different model types? If yes, prefer to learn how you are going to use it in OpenSearch.

YuanBoXie commented 1 year ago

Yeah，tf2onnx can do this. But tf2onnx can't automatically convert my model.

opensearch-project / ml-commons