sirixdb / sirix

SirixDB is an an embeddable, bitemporal, append-only database system and event store, storing immutable lightweight snapshots. It keeps the full history of each resource. Every commit stores a space-efficient snapshot through structural sharing. It is log-structured and never overwrites data. SirixDB uses a novel page-level versioning approach.
https://sirix.io
BSD 3-Clause "New" or "Revised" License
1.13k stars 252 forks source link

Fulltext search #563

Open JohannesLichtenberger opened 1 year ago

JohannesLichtenberger commented 1 year ago

We need to have a way to do fulltext search on text nodes. Probably therefore it's possible to include Lucene.

Rathan-Naik commented 1 year ago

I can pitch in here.

JohannesLichtenberger commented 1 year ago

We have to check, if we can somehow implement some kind of a store (I think it's called Directory) and the fields, as our main data structure is a keyed trie indexing 64 bit nodeKeys <=> nodes and it would be great if we could store the full text index likewise in our persistent structure. Haven't checked Lucene, though.

adamretter commented 1 year ago

We make use of Lucene in eXist-db for the Full Text index. There are definitely advantages and disadvantages to using Lucene.

On the one hand Lucene is very mature and flexible whilst offering decent performance. If you want to implement something like the W3C XQuery Full Text extensions, it will have almost everything you need baked in. Also, you can allow users to choose or code their own Analyzers for pretty much any language or purpose which is neat.

On the other hand, if you need transactional consistency, as far as I am aware there is no good way to involve Lucene in the transactions against your own indexes. I enquired some time ago, so perhaps things have changed more recently, but previously there was no way to control Lucene transactions directly, so you could not do a 2PC approach.

JohannesLichtenberger commented 1 year ago

Hi Adam, isn't the single writer supposed to implement the two phase commit interface https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/TwoPhaseCommit.html ?

I had a quick look, and I think we'd need to implement a custom Directory... but I'm not sure if we can somehow store the Documents in another subtree (in a trie) as we do with the other indexes. Thus, it would be automatically versioned which is what we need after all. AFAICS, the documents are written in DocumentsWriter, which is sadly not an interface and also instances are created directly in IndexWriter. Thus, I'm not sure if it's even possible to change the index structure in which lucene stores the documents besides the actual Directory to store to/read from!?