usethesource / capsule

The Capsule Hash Trie Collections Library
BSD 2-Clause "Simplified" License
404 stars 28 forks source link

Ordered collections #2

Open pniederw opened 8 years ago

pniederw commented 8 years ago

Will capsule also provide sets/maps that maintain insertion order, or is this out of scope? And will it provide lists? Sorry for asking these questions here, didn't know where else to ask.

msteindorfer commented 8 years ago

Hi Peter, sets/maps with insertion order are well in scope for capsule. Last summer I started to explore designs of such data structures for another project. At the moment, implementing insertion order is not my top priority, thus it will take some time until I'll add them.

Eventually I'm also going to add lists (likely something similar to RRB Trees).

The next data structures that I'll add to capsule will be sets/maps that can mix primitive elements with reference elements in a memory efficient way. The reasoning behind is to provide the best of specialized primitive collections and of generic collections in a single implementation (planned timeline: end of march).

I'll add milestones for what's missing / planned. Thanks for voicing your interest.

pniederw commented 8 years ago

Sounds great! Looking forward to all of this.

ilya-g commented 7 years ago

https://github.com/usethesource/capsule/blob/master/capsule-experimental/src/main/java/io/usethesource/capsule/experimental/ordered/OrderedTrieMap.java#L150 Interesting idea with sequence ids. What happens when nextSequenceId overflows?

msteindorfer commented 7 years ago

Hi @ilya-g, thanks for asking. The OrderedTrieMap implementation doesn't implement overflow handling yet, however I'll sketch the idea: in the constructor you can check if size reaches MAX_INT, if this happens you need to traverse the tree in insertion order and reassign subsequent IDs to each item (to essentially get rid of holes in the current sequence IDs).

Rewriting the tree for reassigning IDs is a costly operation, however shouldn't occur in practice often. The scenario where you'd encounter it is if you have really large collections and perform plenty of alternating inserts and deletes (because only deletes create holes in the sequence IDs). Other operations such as put do not update the sequenceId of a tuple.

If you like to use the OrderedTrieMap, give it a try. Apart from the missing overflow handling (that I'm willing to add as well) it's a fully functioning ordered map. I implemented a similar version of this data structure for a closed source project some time ago.

Note, the OrderedTrieMap is currently located in the capsule-experimental module, but with some integration work I'd be able to move it to the main capsule module in the near future.

ilya-g commented 7 years ago

Another option is to use Long for sequenceId – it hardly overflows in a hundred years. 😀

The thing that concerns me though is that one has to pay O(n∙log(n)) cost for sorting entries before iteration, and in some scenarios it can happen every time iteration is started (i.e. you iterate collection once and then modify it somehow).

Have you considered keeping an immutable sorted map of sequenceId=>entry and therefore spreading that cost among modification operations?

msteindorfer commented 7 years ago

Sorry for the late response. To answer your questions: I wouldn't like to use Long because it introduces to much overhead in my opinion.

Regarding sorting before iteration: this is all about implementation choices. To avoid sorting at all, one could use a reversed map as you pointed out correctly as well. The other solution would be to sort lazily upon the first time of iteration (as done now), and from then on update the reverse index incrementally.