Closed sanikolaev closed 1 month ago
Any help required?
In short, we want to build a Rust library that can be used from the Manticore daemon source code. This library should have one simple function - converting text to a vector. As a proof of concept, it can be a basic implementation that exposes just one function. This function should accept a list of characters or some native C++ data type and return a vector of floats. Everything can be hardcoded for now.
The base code and libraries to use can be taken from this PHP extension: https://github.com/manticoresoftware/php-ext-model
If you can help with a simple implementation to validate this concept and benchmark it with the Manticore daemon, that would be great. We appreciate your assistance.
In short, we want to build a Rust library that can be used from the Manticore daemon source code. This library should have one simple function - converting text to a vector. As a proof of concept, it can be a basic implementation that exposes just one function. This function should accept a list of characters or some native C++ data type and return a vector of floats. Everything can be hardcoded for now.
The base code and libraries to use can be taken from this PHP extension: https://github.com/manticoresoftware/php-ext-model
If you can help with a simple implementation to validate this concept and benchmark it with the Manticore daemon, that would be great. We appreciate your assistance.
Just want to make it clear as possible. You want some DLL, which expose one function. What's text argument of that function? Is input text std string or not?
In short, we want to build a Rust library that can be used from the Manticore daemon source code. This library should have one simple function - converting text to a vector. As a proof of concept, it can be a basic implementation that exposes just one function. This function should accept a list of characters or some native C++ data type and return a vector of floats. Everything can be hardcoded for now. The base code and libraries to use can be taken from this PHP extension: https://github.com/manticoresoftware/php-ext-model If you can help with a simple implementation to validate this concept and benchmark it with the Manticore daemon, that would be great. We appreciate your assistance.
Just want to make it clear as possible. You want some DLL, which expose one function. What's text argument of that function? Is input text std string or not?
Absolutely, a shared library (DLL or so for Linux). This library should be usable within C++ code. The interface can accept std::string or just a pointer to a list of chars, whichever is the most effective way possible.
OK, then std::string into vector of 32bit floats. I am not sure about dynamic linking since all my experiments used static linking.
could be better to pass const char *
and length
or string_view
as we do not use any std
containers and pass std::string
means allocation and copy from plain const char *
could be better to pass
const char *
andlength
orstring_view
as we do not use anystd
containers and passstd::string
means allocation and copy from plainconst char *
Is your strings not null terminated? Is string not UTF-8 correct?
Hey @AbstractiveNord, thanks a bunch for this pull request! We actually already built a separate library, and your contribution will definitely help us streamline the integration process.
Here's the library we're looking to use in Manticore: https://github.com/manticoresoftware/manticoresearch-text-embeddings
There are a few limitations we'd like to address:
Take a look at the examples
folder – it has some samples demonstrating how to use C to call the library, along with benchmarks. The headers also include build instructions.
Building the library is simple: just install Rust and use cargo:
cargo build --lib --release
The dynamic library will be located in the target/release
folder after building.
Regarding the library itself, we're seeing almost no overhead in terms of time, but this still needs further validation. I've added some tests for this purpose.
Trying to integrate it now sounds like a solid plan. Let's also do some profiling to get a better grasp of the overhead, not just in terms of time but also memory usage. I tried sticking with native code as much as possible, but since we're using an external library, we'll need to convert input for internal types. Otherwise, we'd have to rebuild the external libraries, which isn't ideal. The good news is that benchmarking shows minimal overhead.
So, we end up returning a *const f32
, which is a native pointer. Keep in mind that Rust handles memory differently, so if we forget about this pointer, it's on the C side to clean things up using free
. Otherwise, we might run into memory leaks.
As discussed before here are points for consideration:
To resume, what @donhardman has done proves the rust library can work in C, but it doesn't have anything to do with Manticore. As discussed on yesterday's call, the next step is to implement some DEBUG embedding
command in Manticore Search to prove the concept to the point when we understand that it works with Manticore. What @AbstractiveNord has done here https://github.com/manticoresoftware/manticoresearch/pull/1148/files may be helpful.
I've pushed branch https://github.com/manticoresoftware/manticoresearch-text-embeddings/tree/cpp_bind to the rust lib. Also I've pushed branch 'embeddings' into manticore source tree. Both branch should work together.
On rust side - 'cargo build --lib', or 'cargo build --lib --release'. On manticore side:
mysql> debug load embeddings '/opt/work/manticoresearch-text-embeddings/target/release/libmanticoresearch_text_embeddings.dylib';
+-----------------------+--------+
| command | result |
+-----------------------+--------+
| debug load embeddings | Ok |
+-----------------------+--------+
1 row in set (0,01 sec)
mysql> debug load model 'sentence-transformers/multi-qa-MiniLM-L6-cos-v1';
+------------------+-------------------------------------+
| command | result |
+------------------+-------------------------------------+
| debug load model | hidden_size=384, max_input_size=512 |
+------------------+-------------------------------------+
1 row in set (0,07 sec)
mysql> debug embeddings 'This is a sample text.';
+-----------+-------------+
| embedding | value |
+-----------+-------------+
| 0 | -0.01043452 |
| 1 | 0.06928102 |
| 2 | -0.04854530 |
| 3 | -0.00591148 |
| 4 | 0.03279826 |
| 5 | 0.02965242 |
| 6 | 0.07039508 |
| 7 | 0.00784686 |
...
| 380 | 0.12422077 |
| 381 | 0.10335055 |
| 382 | 0.11339451 |
| 383 | -0.00519720 |
+-----------+-------------+
384 rows in set (0,03 sec)
Names of the commands are arbitrary for POC; that is just for experimenting. Also, header file is inlined into c++ code, see beginning of 'src/embeddings/embeddings.cpp'. Path to the lib for 'debug load embeddings' should be actualized on your instance.
@donhardman pls test it and prepare a further plan of action.
Do you plan optionally utilize GPU by manticoresearch-text-embeddings
? Just interesting about throughput of CPU inference.
Do you plan optionally utilize GPU by
It depends on the library we'll be using. So far we are going to use https://github.com/huggingface/candle . They say:
Candle is a minimalist ML framework for Rust with a focus on performance (including GPU support)
so it should be possible to use GPU.
To move forward, we should discuss and approve the interface for configuring fields that will have auto-embedding.
Currently, I suggest starting with the following specification:
fields = "field1, field2"
: fields we should use as the source for text to generate embeddingsmodel_name = "openai/..."
or model_name = "sentence-transformers/..."
for local modelsmodel_cache_path = "..."
: path to the cache directory where we will store everything in case of local model usageapi_key = "..."
: for cases where we use a remote API for embeddingsuse_gpu = 1|0
: for cases where we use a local model and need to use GPU (if available), default is always CPUIt may look like this:
CREATE TABLE test (
title TEXT,
image_vector FLOAT_VECTOR KNN_TYPE='hnsw' KNN_DIMS='4' HNSW_SIMILARITY='l2'
MODEL_NAME = "..."
MODEL_CACHE_PATH = "..."
);
I've also refactored the code and updated the interface. It's subject to review and discussion on whether we should proceed with it or not: https://github.com/manticoresoftware/manticoresearch-text-embeddings
Things to consider:
@klirichek please review my changes in the interface and let me know if it's all OK
maybe if api_key
and model_name
set for remote model it needs additional validation at the CREATE TABLE
that parameters are good and remote API accepts them - to fail CREATE TABLE
if the user needs openai model but the internet is not available or api_key
is wrong.
Maybe such check also needs at the daemon start or daemon restart after crash to disable such index or put it into read-only modex.
Also notice please that's api_keys is secret and dynamic information, can change frequently.
As discussed before, I have split this task into multiple ones:
Closing this issue as the Proof of Concept (POC) is done.
As discussed in the dev call on April 18, 2024, we'd like to integrate an ML library written in Rust with our Manticore Search code. Before proceeding, we'd like to experiment with using a Rust library in C++ code in principle. This task is to conduct the experiment.
Checklist
To be completed by the assignee. Check off tasks that have been completed or are not applicable.