sirixdb / sirix

SirixDB is an an embeddable, bitemporal, append-only database system and event store, storing immutable lightweight snapshots. It keeps the full history of each resource. Every commit stores a space-efficient snapshot through structural sharing. It is log-structured and never overwrites data. SirixDB uses a novel page-level versioning approach.
https://sirix.io
BSD 3-Clause "New" or "Revised" License
1.11k stars 250 forks source link

Optimize SirixDeweyId.toBytes-method #512

Open JohannesLichtenberger opened 2 years ago

JohannesLichtenberger commented 2 years ago

We can probably use the new Vector API to use explicit SIMD instructions.

IlayMenahem commented 2 years ago

hi i'm new to open source contributing; i read about SIMD and i can help on this task. also can you direct where exactly is SirixDeweyId.toBytes-method.

JohannesLichtenberger commented 2 years ago

The class is called SirixDeweyID and the method is the toBytes(int[])method. We've optimized the method a bit and optimized other stuff in regards to storage of these hierarchical IDs. Runtime is approximately on my Notebook now around 5 minutes for importing a 3.8 Gb file with the generation of these IDs enabled (without it needs around 3:42m with hashes enabled in both cases). I'm currently always profiling the method (you have to remove the ignore annotation and add the cityofchicago.json file (I've uploaded it here: https://www.transfernow.net/dl/20220812X7v5ZTGZ) in the resource json folder:

JsonShredderTest::testChicago

The concept of DeweyIDs is described in the following documents (they are stored using huffman codes as numbers often times repeat in the hierarchical label -- we're using them for our JSON nodes, too): http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/mathis/documents/HHM_SBBD.pdf and http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/publications/2010/HM10.JIDM.pdf

JohannesLichtenberger commented 2 years ago

It would probably a good start to profile the mentioned testChicago-method and switch on/off the DeweyID generation.

IlayMenahem commented 2 years ago

i've tried to use the test with and without @ignore, both times i got the same message. i also tried to find if there is any special guidance on how to conduct test and there wasn't.

The supplied phased action failed with an exception. A problem occurred configuring root project 'sirix-parent'. Failed to notify project evaluation listener. A problem occurred configuring project ':sirix-core'. A problem occurred evaluating project ':sirix-core'. Could not find method api() for arguments [io.sirix:brackit:0.2] on object of type org.gradle.api.internal.artifacts.dsl.dependencies.DefaultDependencyHandler.

JohannesLichtenberger commented 2 years ago

Can you try executing using IntelliJ?

yeohbraddy commented 2 years ago

Hi,

If @IlayMenahem is no longer working on this, I'd like to contribute! It is my first time contributing to open source so apologies if this is against etiquette.

Thank you!

JohannesLichtenberger commented 2 years ago

Sure, thanks :-) I'm also thinking about not storing DeweyIDs for object field values to reduce storage and CPU processing costs. I'm currently still on holiday with my small tablet so I can't code right now.

JohannesLichtenberger commented 2 years ago

@yeohbraddy I guess it's perfectly fair to assume, as in my experience after a week or more without any comments/PRs etc.pp. the user hasn't had time or something like that to work on the issue.

JohannesLichtenberger commented 1 year ago

@yeohbraddy any work in progress?

JohannesLichtenberger commented 1 year ago

@yeohbraddy guess you don't have time to work on the issue!? Let me know, then I'd unassign the issue for now or will have another look myself. I think it should be possible to use explicit SIMD.

yeohbraddy commented 1 year ago

Hi @JohannesLichtenberger, sorry I missed the previous ping! Yes I don't have time to work on this issue unfortunately, I started a new job 😌

ConnorCable commented 1 year ago

Hi! I’m new to OSS and interested in taking this on. I think I have a a grasp of the issue here but I’d like to jump on a call with whoever to learn more and get a sense of direction .

thank you!

JohannesLichtenberger commented 1 year ago

Sounds great. The thing is, that these IDs even if huffman encoded almost double storage costs in my tests and also have a huge performance penalty due to the toBytes() method, which seems to add up CPU time. The class is based on this class: https://github.com/sebbae/brackitdb/blob/master/brackitdb-server/src/main/java/org/brackit/server/node/XTCdeweyID.java

You can check JsonShredderTest::testChicago and enable the storage of DeweyIDs. You'll need this file to execute the test: https://www.file-upload.net/download-15038251/cityofchicago.zip.html

JohannesLichtenberger commented 1 year ago

You can also join our Discord channel if you need further explanations :-)

adamretter commented 1 year ago

@JohannesLichtenberger I have been lurking for a while. I contribute to eXist-db where we use Dynamic Level Numbering for our hierarchical ids. Perhaps they would be more efficient for your needs?

ConnorCable commented 1 year ago

@adamretter Hey Adam, do you have any papers or resources regarding Dynamic Level Numbering I can take a look at?

JohannesLichtenberger commented 1 year ago

Hi Adam, awesome, that you've had an eye on SirixDB 👍 can you elaborate how they are implemented in ExistDB? I think in SirixDB they are especially useful for sorting regarding document order of XQuery results. However, I switched to disable them for JSON storage per default, because of the space consumption and drop regarding insertion latency.

BTW: Do you have an idea how to store bigger files and run automatoc integration tests on these?

JohannesLichtenberger commented 1 year ago

Even Git LFS does only allow to store up to 2 or 3 Gb of data at most IIRC (but maybe that's another issue :-))

adamretter commented 1 year ago

@ConnorCable The paper is titled Supporting Efficient Streaming and Insertion of XML Data in RDBMS and is authored by Timo Böhme, and Erhard Rahm.

There is an implementation of this in eXist-db here - https://github.com/eXist-db/exist/blob/develop/exist-core/src/main/java/org/exist/numbering/DLN.java#L42

The implementation in eXist-db is by no means perfect. eXist-db also introduces sub-levels to support XQuery Update. If I recall, it had some bugs and further optimisations that could be made... that we improved upon for our DLN implementation in FusionDB.

I actually think the ORDPATH approach from Microsoft has some advantages over DLN; unfortunately though Microsoft hold a patent on ORDPATH, and so I deemed it unimplementable for my uses.

adamretter commented 1 year ago

I think in SirixDB they are especially useful for sorting regarding document order of XQuery results

@JohannesLichtenberger Sure. We also use them for fast sibling, parent, and ancestor lookups/proxies.

BTW: Do you have an idea how to store bigger files and run automatoc integration tests on these?

I am not quite sure what you are asking about here. I haven't looked in detail into how SirixDB stores its data. eXist-db stores its XML as the events (SAX like) needed to build a DOM in a B+Tree. FusionDB takes a similar approach but uses an LSM-tree instead.

JohannesLichtenberger commented 1 year ago

The serialization at least is straight forward without the huffman encoding regarding DLN. I'm also skipping values, which share the same prefix in the page fragment during serialization :-)

True, but I think the hierarchical node labels suffer from worse performance regarding the axis traversal in comparison to the pre/post or pre/size/dist (I don't remember the exact encoding in BaseX), or in our case a persistent DOM. However, maybe it also depends on the index structure. I think they are ideally also stored in a trie instead of as in BrackitDB in a B+ tree.

It would be amazing to see a version of storing only the leaf nodes plus the path summary (I think in BrackitDB/XTC they called it elementless storage).

In SirixDB we needed unique IDs which are never reassigned (thus based on a simple AtomicLong-based sequence generator), and we store a nodeKey/parent/firstChild/lastChild/rightSibling/leftSibling encoding (doubly linked list of children basically) in order to get rid of the variable length node labels as the DeweyIDs we mention over here. But I think in some cases they are also really useful as also mentioned in the papers by Theo Haerder, even if we use a single writer for instance.

Regarding DeweyIDs in SirixDB: we can quickly select the diffs in subtrees of specific nodes between two revisions and also in document/preorder, which is also nice in the case of storing JSON, so it might also be useful in some cases, that's why I've implemented the optional storage in the first place, but performance and storage costs currently are high.

BTW: Do you already have a query engine for FusionDB? Otherwise, I'd like to mention Brackit, which implements JSONiq with set-orientated operators and implicit join recognition to support hash joins, sort-merge joins... and it's supposed to be a query engine for different backends where you can add AST rewrite rules for index accesses, add additional expressions... and it should implement all optimizations, which are not part of the individual storage engine :-)

Sorry for the very, very long reply, but I'm missing people with whom I can share some thoughts and ideas.

JohannesLichtenberger commented 1 year ago

With bigger files, I mean to import JSON or XML files up to at least a few Gb and do other integration tests. Maybe most database system engineers add data via the APIs instead of importing Gbs of data from other storage formats for the tests...

Currently, I'm commenting/uncommenting tests locally, but that's, of course, not the right way because I can't store these files in Git.