[OEP 11] Lucene Improvements

smolinari commented 7 years ago

Ok. Here is a go at my first OEP suggestion.

Summary: Along with the work being done to upgrade the Lucene version to 6.2.0, this OEP will take note of the features requested by customers to make the full text search within OrientDB one of the most powerful among any current database.

Goals: List the requested features for the full-text indexing within OrientDB and come up with a final list to be accomplished for 3.0. If there were issues already created with more information, I've added links to them.

Non-Goals: This lists only fulltext search features. Geolocation features are not included.

Success metrics: Unknown currently.

Motivation: In order to be a truly multi-model database, a very good text search system should also be implemented into the database.

Description:

Here the list of requested improvements:

Offer Lucene Metrics - When indexes become very populated, you may need to analyze the status (fragmentation, memory allocation, hit ratio, open files, ...)
Ranked searching -- best results returned first
Multi-field/ Fielded searching (e.g. title, author, contents) done
Flexible faceting, highlighting done , joins and result grouping
Sorting by any field
Fast, memory-efficient and typo-tolerant suggesters
Similarity queries or "More Like This" results. And/ Or synonyms - results similar to the requested search terms. done
Cross class indexing (not sure if this is a Lucene feature, but I added it anyway. https://github.com/orientechnologies/orientdb/issues/5069 done
Support indexing embedded types Lists/Objects
Add support for all possible Lucene query types. https://github.com/orientechnologies/orientdb/issues/5189
Allow field names for manual index - If we can label the fields making up the key in a manual lucene index, it will make the search capability more versatile.
[edit] Full-text search for non-schema defined fields.

Alternatives: A bridge/ river/ interface to Elasticsearch could theoretically be an alternative, where data is only entered or removed from the Elasticseach indexes as needed. All the other Elasticseach APIs can be used directly by the end user for fulltext searches.

Risks and assumptions: None currently.

Impact matrix

[ ] Storage engine
[ ] SQL
[ ] Protocols
[ ] Indexes
[ ] Console
[ ] Java API
[ ] Geospatial
[X] Lucene
[ ] Security
[ ] Hooks
[ ] EE

robfrank commented 7 years ago

Sorry for the long delay. I try to answer. better, I try to open the discussion.

I agree, I should understand how and where put this stats. This involves Studio too
Maybe: in OrientDB Lucene is used as AN index, it is not THE index. Usually, you will have a complex query (SQL) with group by and order by clauses AND a Lucene part: who win? It will works only inside queries where the only index involved will be the lucene one and orider by isn't used
This is implemented: http://orientdb.com/docs/last/Full-Text-Index.html#working-with-multiple-fields
Faceting, highlight and suggester could be implemented only if we change the way results of a query are returned. At the moment only a list of ODocument (Vertex) is returned. We need to return a complex (not so much complex) resutlSet that could contain the documents AND a lot of metadata
Same of 2
same of 4
same of 4: we need metadata on resultset
I'm working on it. It's not so easy as it looks from 10000ft.
Umh, AFAIK we got it, let me check
Ok, approved
Can you give me an example?

smolinari commented 7 years ago

Hi Roberto,

Thanks for responding.

I figured I'd put the most important suggestions together for the Lucene index here. Unfortunately, I am not the initiator of some of the suggestions, so I can't give you the answers to some of your concerns or needs.

I will say to 2. The suggestion is purely about the full text features. Groupby and Orderby wouldn't be a part of such a search, as the ranking would determine the order. Groupings might be a consideration, if it is possible, but ranking would certainly only be against the Lucene indexed fields.

For 8. I think it would be enough, if we could get the results with grouping on the classes, maybe limited to 5 or 10 results at first. Would that simplify the ability to search across classes? Consider classes domain objects and a search across these domains would bring in results from any domain object, but reduced to only a few of the best ones. If a user needs more in-depth searching in one domain, they'd request/ask for the full list of results from that one domain(class).

For 11. Here is the issue I got the suggestion from. It explains more I believe.

https://github.com/orientechnologies/orientdb/issues/5185

Let me add, the full-text searching in other NoSQL and even SQL solutions leaves a lot to be desired. It is why solutions like Elasticsearch wins in population. This nugget of gold in ODB, if polished to shine, would blow the socks off of other NoSQL solutions, especially like MongoDB, who's full-text capabilities are majorly weak.

One other thing that need looking at is

12 - Full-text search for non-schema defined fields.

I am personally still up in the air about this myself, as I am not sure how to get the definitions done, to properly index without the schema. But for sure, this is necessary, in order for ODB to call itself a NoSQL database. The fact ODB can't handle the indexing of fields without defined schema means to me, ODB isn't NoSQL. It is only SQL.

Scott

robfrank commented 7 years ago

Hi @smolinari, we are going forward, take a look at https://github.com/orientechnologies/orientdb/issues/7155

We are moving to functions to allow more flexibility using the search feature and this way full-text and spatial will be homolog.

BTW, I really don't understand the 12, Full-text search for non-schema defined fields. I'm in the field of search for a while right now, 10 years more or less. For sure we can index ALL props of all documents being stored inside OrientDB. But: what can you find this way? Which analyzers should be used to indexing? The StandardAnalyzer is suitable only for western languages. Try it on Chinese or Japanese and you will get a completely useless index.

smolinari commented 7 years ago

Looking good! Can't wait to see how 3.0 runs.

With 12 I mean, being able to call an index on any property, without it being specifically created in ODB's schema system. This is actually not a problem with search, but the fact that ODB requires a schema definition to create an index in general.

What I am looking for is what Mongo offers in the way of creating indexes without schema definitions in the database. In other words, if I say there is a property in a class as a developer, than that is all ODB should need to allow indexing. It is up to me, as a developer, to make sure that property is really there.

If this concept doesn't change, ODB's claim to be a NoSQL database is only a half-truth.

Scott

orientechnologies / orientdb-labs

[OEP 11] Lucene Improvements #11