manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.67k stars 483 forks source link

Lookup for vector in different index while searching for similar docs. #2351

Open starinacool opened 1 week ago

starinacool commented 1 week ago

Proposal:

In a query like this, it would be nice to specify the source index of vector:


select ... from <table name> where knn ( <field>, <k>, <document id> [, <other table name>.<field>] )

<other table name>.<field> - is the notation of other index and field where to look up a vector for searching in <table name>

In our case documents are spread in different indexes and I want to be able to find similar documents in other index then the source doc index.

Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [ ] Implementation completed - [ ] Tests developed - [ ] Documentation updated - [ ] Documentation reviewed - [ ] Changelog updated - [x] OpenAPI YAML updated and issue created to rebuild clients
sanikolaev commented 1 week ago

Doesn't joining another table work in your case? https://manual.manticoresearch.com/Searching/Joining

starinacool commented 6 days ago

If joining can lookup a vector from index A and find similar docs in table B then it will work. But I do not see how it could be done with joins.

sanikolaev commented 6 days ago

and find similar docs in table B

similar to what?

starinacool commented 6 days ago

Similar to the vector from index A.

starinacool commented 6 days ago

The idea is to use vector from one index to search in another index.

sanikolaev commented 6 days ago

Here's an example of knn + joined tables + ordering by knn + filtering by string in the joined table:

mysql> drop table if exists test; create table test ( title text, image_vector float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' ); insert into test values ( 1, 'yellow bag', (0.653448,0.192478,0.017971,0.339821) ), ( 2, 'white bag', (-0.148894,0.748278,0.091892,-0.095406) ); drop table if exists test2; create table test2 (color string); insert into test2 values(1, 'red'), (2, 'green'); select
id, title, knn_dist(), test2.color from test left join test2 on test2.id = test.id where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926), 2000 ) and test2.color='red';
--------------
drop table if exists test
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table test ( title text, image_vector float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' )
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into test values ( 1, 'yellow bag', (0.653448,0.192478,0.017971,0.339821) ), ( 2, 'white bag', (-0.148894,0.748278,0.091892,-0.095406) )
--------------

Query OK, 2 rows affected (0.00 sec)

--------------
drop table if exists test2
--------------

Query OK, 0 rows affected (0.03 sec)

--------------
create table test2 (color string)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into test2 values(1, 'red'), (2, 'green')
--------------

Query OK, 2 rows affected (0.00 sec)

--------------
select id, title, knn_dist(), test2.color from test left join test2 on test2.id = test.id where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926), 2000 ) and test2.color='red'
--------------

+------+------------+------------+-------------+
| id   | title      | knn_dist() | test2.color |
+------+------------+------------+-------------+
|    1 | yellow bag | 0.28146550 | red         |
+------+------------+------------+-------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---
starinacool commented 6 days ago

I will clarify.

The idea is to use vector from one index to search in another index by that vector.

So, both indexes has same dimension vectors.

create table test1 (id int, title text, vector float_vector knn_type='hnsw' knn_dims='4' )
create table test2 (id int, title text, vector float_vector knn_type='hnsw' knn_dims='4' )

The idea is to have a shortcut for these requests:

select vector as "VECTOR_FROM_1" from test1 where id=1234;
select id, title from test2 where knn ( vector, 5, (VECTOR_FROM_1), 2000 );

So instead of two requests combine it to one

select id, title from test2 where knn ( vector, 5, 1234 , test1.vector )
sanikolaev commented 4 days ago

Thank you! The idea is more clear now. In your original spec you said:

<other table name>.<field> - is the notation of other index and field where to look up a vector for searching in <table name>

How are the <other table name> and <table name> related in your case? Can they be joined by id or smth else?

starinacool commented 4 days ago

No. Two tables consist of different data. The only thing in common is a vector fields with the same dimension.

sanikolaev commented 3 days ago

In your spec

select ... from <table name> where knn ( <field>, <k>, <document id> [, <other table name>.<field>] )

is <document id> meant to be a document id from <other table name> or <table name>?

starinacool commented 1 day ago

id from <other table name>