Fuzzy string matching indexes

igrekun commented 2 years ago

Is your feature request related to a problem? Please describe. Running full fledged Elasticsearch cluster to search short strings seems like an overkill.

Describe the solution you'd like Basic tri-gram or tf-idf / cosine distance for simple fuzzy string matching.

Describe alternatives you've considered Elasticsearch.

Additional context I will more than likely work on implementing those indexes, but Jamie Liu told me it's better to get alignment with the team by filing an issue first.

What are your thoughts on a simple but native text index for Nebula? It can't replace Elasticsearch but should suffice for matching names and similar use cases.

The options under consideration are

GIN / GiST indexes pretty much like Postres implements them
Approximate KNN over TF-IDF embeddings like NMSLIB / FAISS does it.

Sophie-Xie commented 2 years ago

@cangfengzhs @critical27 What do you think about this issue？

jamieliu1023 commented 2 years ago

@igrekun Thanks for bringing this up for discussion! I do think it is a recommended practice to first file an issue to get alignment between the two sides before you actually start working on it, which can avoid waste in time and energy just in case the solution you provide does not match the design of the core team. Hope that makes sense to you.

Please @critical27 @cangfengzhs provide your insight on this issue.

critical27 commented 2 years ago

Thx for contacting us. So you are going to add tri-gram index in nebula instead of es, right?

cangfengzhs commented 2 years ago

These are indeed some very useful features. I have learned some knowledge of NLP, but I don't fully grasp it. And I have the following questions:

If I am a user, and I use string index based on cos distance. But how do I do word2vec? Does nebula have a built-in vocabulary or do I upload a vocabulary myself? Or even further, is it possible to provide fasttext directly as a plug-in?
How to calculate IDF in a constantly changing data set?
I have not learned about GIN / GiST. But in my impression, both nmslib and faiss need a lot of memory, right? Is there a way to solve this problem? The CPU may have the same problem.
How to make persistence? For example, when nmslib performs persistence, it serializes all data together into a binary file. So we need to reduce unnecessary write overhead in the form of LOG and checkpoint. This is just a preliminary idea, and a detailed design is needed.

I am looking forward to the realization of native fuzzy string index. Maybe you can make a simple design first, and then we will discuss its possible problems and try to solve them. Moreover, I am also very willing to participate in the development process of this feature.

cangfengzhs commented 2 years ago

And, I have a bolder idea. The key point of this feature is the vector ANN search. We may be able to support a vector data type and expose it to users. We are only responsible for the search of vectors. After all, the performance of textCNN/ELMo/Bert will be better than tri-gram.

igrekun commented 2 years ago

@cangfengzhs I very much like the bold idea (: Generic vector type that supports cosine / L2 distance to better handle user supplied vectors should cover it.

Fuzzy search is then done by treating each trigram as a word and searching closest by cosine. Anything more fancy should then be "bring your own vectors" not to clutter the core codebase.

Personally I fancy the idea of generic ANN more than GIN / GiST since it is more general. Given the vector type, graph and basic math one could run ML algorithms without reaching for spark.

If we go with ANNs then I almost have a design to build upon, just let me know where to move further technical discussion if we decide to proceed on this!

vesoft-inc / nebula

Fuzzy string matching indexes #3526