Open igrekun opened 2 years ago
@cangfengzhs @critical27 What do you think about this issue?
@igrekun Thanks for bringing this up for discussion! I do think it is a recommended practice to first file an issue to get alignment between the two sides before you actually start working on it, which can avoid waste in time and energy just in case the solution you provide does not match the design of the core team. Hope that makes sense to you.
Please @critical27 @cangfengzhs provide your insight on this issue.
Thx for contacting us. So you are going to add tri-gram index in nebula instead of es, right?
These are indeed some very useful features. I have learned some knowledge of NLP, but I don't fully grasp it. And I have the following questions:
I am looking forward to the realization of native fuzzy string index. Maybe you can make a simple design first, and then we will discuss its possible problems and try to solve them. Moreover, I am also very willing to participate in the development process of this feature.
And, I have a bolder idea. The key point of this feature is the vector ANN search. We may be able to support a vector
@cangfengzhs I very much like the bold idea (: Generic vector type that supports cosine / L2 distance to better handle user supplied vectors should cover it.
Fuzzy search is then done by treating each trigram as a word and searching closest by cosine. Anything more fancy should then be "bring your own vectors" not to clutter the core codebase.
Personally I fancy the idea of generic ANN more than GIN / GiST since it is more general. Given the vector type, graph and basic math one could run ML algorithms without reaching for spark.
If we go with ANNs then I almost have a design to build upon, just let me know where to move further technical discussion if we decide to proceed on this!
Is your feature request related to a problem? Please describe. Running full fledged Elasticsearch cluster to search short strings seems like an overkill.
Describe the solution you'd like Basic tri-gram or tf-idf / cosine distance for simple fuzzy string matching.
Describe alternatives you've considered Elasticsearch.
Additional context I will more than likely work on implementing those indexes, but Jamie Liu told me it's better to get alignment with the team by filing an issue first.
What are your thoughts on a simple but native text index for Nebula? It can't replace Elasticsearch but should suffice for matching names and similar use cases.
The options under consideration are