Open JohannesLichtenberger opened 1 year ago
Still open ?
Yes
@anmol797 we should probably chat about this
Yes
can i take it up ? i am new to open source contribution
@anmol797 we should probably chat about this
Sure
Are you familiar with basic data structures, especially functional/persistent structures? I'd like to discuss this first with someone already familiar to these topics :-)
Are you familiar with basic data structures, especially functional/persistent structures? I'd like to discuss this first with someone already familiar to these topics :-)
yes i am aware of all these i am a fresher working professional in IT sector
Ok, so basically the main datastructure in Sirix is a "keyed", persistent trie to fetch pages (full pages) or page fragments (e.g. only changed records plus records which fall out of a sliding window). The data is stored in the leaf pages of the tries. Currently, we store JSON nodes or XML nodes in a trie. Secondary indexes based on Red/Black balanced binary trees are currently stored in other tries.
Now, as red/black trees are not cache-line friendly (and hopefully Valhalla bears some fruits better sooner than later) we could alternatively store the secondary indexes as Adaptive Radix trees (or the newer variant Height Optimized trees). That said to incorporate also the sliding snapshot algorithm used to version the current leaf pages is a lot of work...
I'm currently not sure if the rotations due to balancing a binary tree lead to a lot of copied tree nodes (which are currently stored in the trie leaf pages). Instead of implementing a persistent ART with also versioning the leaf pages it would of course be much simpler to simply read the stored red black tree nodes into an ART, but not sure if that really would make sense.
On the other hand it's currently "nice to have", but a lot of work (maybe half a year or maybe even much more as it's done in spare time).
Another pressing issue is, why a full scan of a bigger resource (3,8Gb JSON file stored in Sirix) in parallel traversed by N read-only trxs is much slower than with only one trx (it's not at all obvious for me currently when profiling what's the issue).
Ok, so basically the main datastructure in Sirix is a "keyed", persistent trie to fetch pages (full pages) or page fragments (e.g. only changed records plus records which fall out of a sliding window). The data is stored in the leaf pages of the tries. Currently, we store JSON nodes or XML nodes in a trie. Secondary indexes based on Red/Black balanced binary trees are currently stored in other tries.
Now, as red/black trees are not cache-line friendly (and hopefully Valhalla bears some fruits better sooner than later) we could alternatively store the secondary indexes as Adaptive Radix trees (or the newer variant Height Optimized trees). That said to incorporate also the sliding snapshot algorithm used to version the current leaf pages is a lot of work...
I'm currently not sure if the rotations due to balancing a binary tree lead to a lot of copied tree nodes (which are currently stored in the trie leaf pages). Instead of implementing a persistent ART with also versioning the leaf pages it would of course be much simpler to simply read the stored red black tree nodes into an ART, but not sure if that really would make sense.
On the other hand it's currently "nice to have", but a lot of work (maybe half a year or maybe even much more as it's done in spare time).
Another pressing issue is, why a full scan of a bigger resource (3,8Gb JSON file stored in Sirix) in parallel traversed by N read-only trxs is much slower than with only one trx (it's not at all obvious for me currently when profiling what's the issue).
okay okay , understood , so basically the optimization of Read need to be done ""
" Instead of implementing a persistent ART with also versioning the leaf pages it would of course be much simpler to simply read the stored red black tree nodes into an ART, but not sure if that really would make sense." can you please explain a bit more ? and this is the approach you want me to use ? "Now, as red/black trees are not cache-line friendly (and hopefully Valhalla bears some fruits better sooner than later) we could alternatively store the secondary indexes as Adaptive Radix trees"
Point is it's currently more of a "nice to have" thing with a huge possibility that work is not going to be finished and it's more like "this could be better" ;-)
However, I think what would be valuable in any case would be to separate the PageReadOnlyTrx
and the PageTrx
from the current trie implementation. IMHO these should be the StorageEngineReader
,StorageEngineWriter
with separate KeyedTrieReader
,KeyedTrieWriter
classes. Maybe you'd like to start with this "subtask" to get familiar with the code base?
TreeModifierImpl
and the interface would probably be the KeyedTrieWriter
.
Oh and BTW: we have to fix this first as currently the CI build always fails due to an old docker image format used by Keycloak 7.0.1, which is of course super old and should be updated: https://github.com/sirixdb/sirix/issues/711
Point is it's currently more of a "nice to have" thing with a huge possibility that work is not going to be finished and it's more like "this could be better" ;-)
However, I think what would be valuable in any case would be to separate the
PageReadOnlyTrx
and thePageTrx
from the current trie implementation. IMHO these should be theStorageEngineReader
,StorageEngineWriter
with separateKeyedTrieReader
,KeyedTrieWriter
classes. Maybe you'd like to start with this "subtask" to get familiar with the code base?
yeah sure , that will be more effective and helpful for me
please let me know what to start with
Oh and BTW: we have to fix this first as currently the CI build always fails due to an old docker image format used by Keycloak 7.0.1, which is of course super old and should be updated: https://github.com/sirixdb/sirix/issues/711
okay i can help with that too
You could start with the Keycloak issue and then with the proposed refactoring?
You could start with the Keycloak issue and then with the proposed refactoring?
ok , please tell me how to start with that
You have to update the docker files and check that it works again, as startup scripts to import the realm are not supported anymore
You have to update the docker files and check that it works again, as startup scripts to import the realm are not supported anymore
how to test that change ? any reference ?
Currently, we're reading the red-black tree nodes from the disk on the first load and putting the nodes into a global buffer manager/cache. Still, we could, for instance, also use the adaptive radix tree or an even better data structure for read-only...