manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.81k stars 487 forks source link

POC: integrate Rust ML library with Manticore Search C++ code #2074

Open sanikolaev opened 4 months ago

sanikolaev commented 4 months ago

As discussed in the dev call on April 18, 2024, we'd like to integrate an ML library written in Rust with our Manticore Search code. Before proceeding, we'd like to experiment with using a Rust library in C++ code in principle. This task is to conduct the experiment.


Checklist

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [x] Task estimated - [x] Bug reproduced - [x] Specification created, reviewed and approved - [ ] Implementation completed - [x] Tests developed - [x] Documentation updated - [x] Documentation proofread - [x] Changelog updated - [x] OpenAPI YAML updated and issue created to rebuild clients
AbstractiveNord commented 4 months ago

Any help required?

donhardman commented 4 months ago

In short, we want to build a Rust library that can be used from the Manticore daemon source code. This library should have one simple function - converting text to a vector. As a proof of concept, it can be a basic implementation that exposes just one function. This function should accept a list of characters or some native C++ data type and return a vector of floats. Everything can be hardcoded for now.

The base code and libraries to use can be taken from this PHP extension: https://github.com/manticoresoftware/php-ext-model

If you can help with a simple implementation to validate this concept and benchmark it with the Manticore daemon, that would be great. We appreciate your assistance.

AbstractiveNord commented 4 months ago

In short, we want to build a Rust library that can be used from the Manticore daemon source code. This library should have one simple function - converting text to a vector. As a proof of concept, it can be a basic implementation that exposes just one function. This function should accept a list of characters or some native C++ data type and return a vector of floats. Everything can be hardcoded for now.

The base code and libraries to use can be taken from this PHP extension: https://github.com/manticoresoftware/php-ext-model

If you can help with a simple implementation to validate this concept and benchmark it with the Manticore daemon, that would be great. We appreciate your assistance.

Just want to make it clear as possible. You want some DLL, which expose one function. What's text argument of that function? Is input text std string or not?

donhardman commented 4 months ago

In short, we want to build a Rust library that can be used from the Manticore daemon source code. This library should have one simple function - converting text to a vector. As a proof of concept, it can be a basic implementation that exposes just one function. This function should accept a list of characters or some native C++ data type and return a vector of floats. Everything can be hardcoded for now. The base code and libraries to use can be taken from this PHP extension: https://github.com/manticoresoftware/php-ext-model If you can help with a simple implementation to validate this concept and benchmark it with the Manticore daemon, that would be great. We appreciate your assistance.

Just want to make it clear as possible. You want some DLL, which expose one function. What's text argument of that function? Is input text std string or not?

Absolutely, a shared library (DLL or so for Linux). This library should be usable within C++ code. The interface can accept std::string or just a pointer to a list of chars, whichever is the most effective way possible.

AbstractiveNord commented 4 months ago

OK, then std::string into vector of 32bit floats. I am not sure about dynamic linking since all my experiments used static linking.

tomatolog commented 4 months ago

could be better to pass const char * and length or string_view as we do not use any std containers and pass std::string means allocation and copy from plain const char *

AbstractiveNord commented 4 months ago

could be better to pass const char * and length or string_view as we do not use any std containers and pass std::string means allocation and copy from plain const char *

Is your strings not null terminated? Is string not UTF-8 correct?

donhardman commented 3 months ago

Hey @AbstractiveNord, thanks a bunch for this pull request! We actually already built a separate library, and your contribution will definitely help us streamline the integration process.

Here's the library we're looking to use in Manticore: https://github.com/manticoresoftware/manticoresearch-text-embeddings

There are a few limitations we'd like to address:

Take a look at the examples folder – it has some samples demonstrating how to use C to call the library, along with benchmarks. The headers also include build instructions.

Building the library is simple: just install Rust and use cargo:

cargo build --lib --release

The dynamic library will be located in the target/release folder after building.

Regarding the library itself, we're seeing almost no overhead in terms of time, but this still needs further validation. I've added some tests for this purpose.

Trying to integrate it now sounds like a solid plan. Let's also do some profiling to get a better grasp of the overhead, not just in terms of time but also memory usage. I tried sticking with native code as much as possible, but since we're using an external library, we'll need to convert input for internal types. Otherwise, we'd have to rebuild the external libraries, which isn't ideal. The good news is that benchmarking shows minimal overhead.

So, we end up returning a *const f32, which is a native pointer. Keep in mind that Rust handles memory differently, so if we forget about this pointer, it's on the C side to clean things up using free. Otherwise, we might run into memory leaks.

donhardman commented 3 months ago

As discussed before here are points for consideration:

  1. Memory Allocation: Allocate the vector in C and pass it as a parameter to Rust.
  2. Model Initialization: Implement an initialization method to download the model (currently using lazy loading). This method should accept a path as an optional parameter.
  3. Interface Update: Modify the interface to utilize ModelPtr as a void pointer instead of the TextEmbeddings wrapper.
  4. Memory Management: Introduce methods to call the black-boxed model and subsequently free up all memory it uses.
sanikolaev commented 3 months ago

To resume, what @donhardman has done proves the rust library can work in C, but it doesn't have anything to do with Manticore. As discussed on yesterday's call, the next step is to implement some DEBUG embedding command in Manticore Search to prove the concept to the point when we understand that it works with Manticore. What @AbstractiveNord has done here https://github.com/manticoresoftware/manticoresearch/pull/1148/files may be helpful.

tomatolog commented 4 weeks ago

we could try to use llama.cpp to skip rust marshaling. It supports huge list of different models and here is a example of embedding in cpp