npgall / concurrent-trees

Concurrent Radix and Suffix Trees for Java
Apache License 2.0
510 stars 82 forks source link

Optimize ByteArrayNodeFactory and CharArrayNodeFactory for "dense" trees #24

Closed knutwannheden closed 7 years ago

knutwannheden commented 7 years ago

This request may also apply to other NodeFactory implementations, but so far I have only been using the ByteArrayNodeFactory and CharArrayNodeFactory implementations.

In my use case I build a tree for 1.5M+ mappings resulting in a RadixTree with 1.9M+ nodes. Of these nodes more than 33% have an incoming edge containing only a single character. Wrapping that single byte (or char) in an array incurs a rather big overhead: 4 bytes for the reference + 16 bytes for the array (12 object overhead, 1 byte payload, 3 bytes of padding). So a total of 20 bytes vs. a single byte (or two bytes for CharArrayNodeFactory). Depending on the padding required for the concrete node implementation I suppose one could save between 16 and 20 bytes of memory, when dealing with nodes with a single character in the incoming edge. For my particular tree that makes quite a difference.

Since I can supply my own NodeFactory implementation I can of course extend the mentioned implementations accordingly. I was just wondering if this improvement may be of general interest.

npgall commented 7 years ago

Hi Knut,

This sounds useful. If you could create a PR which adds the additional node type for each of the factories then it sounds like a useful inclusion!

knutwannheden commented 7 years ago

Hi Niall,

DefaultByteArrayNodeFactory currently already returns one of six different implementations. This number would now go up to twelve. (Any more and I think it might make sense to generate the classes.) I am thinking of using the same naming scheme, but replace the prefix ByteArray with SingleByte. The alternative would be to have the prefix Byte only. Sounds good to you?

The getIncomingEdge() method has to return a CharSequence. For this I will create a very simple implementation, but I am wondering if it would make sense to add a cache like used by Java's Character#valueOf(char) method.

Would you want to also implement this for the DefaultCharArrayNodeFactory?

Detail: I noticed the DefaultByteArrayNodeFactory and its implementation classes refer to UTF-8 in the Javadoc. But AFAIK any code point beyond 0x9F needs two bytes when encoded using UTF-8. So I think referring to the ISO-8859-1 encoding would be more appropriate.

npgall commented 7 years ago

This has become quite a heavy proposal :/

I was originally thinking we might only have one new node implementation per type of NodeFactory, but I see now we'd need 12 new classes. It's a lot of code for a minor improvement.

I'm thinking that for a version 3.x of this project I might add support to store the trees on disk, or in off-heap memory. So this would reduce the need to optimize for every single byte of memory usage on the heap. Is the current memory usage a big problem, or do you think you could wait for off-heap/disk support in future?

knutwannheden commented 7 years ago

I almost expected you to consider this change a bit heavyweight for what it achieves. That is perfectly understandable and not any problem for me, since I can easily supply my own NodeFactory as it is.

What you say with regards to off-heap memory sounds very interesting! Especially since the data I am currently indexing is already off-heap. I will be interested to see how you plan to organize the data in memory, because I have personally also already considered an off-heap implementation for RadixTree. Also, do you already have a target release date for 3.0?

If Java 9 comes with Value Objects (see e.g. https://www.sitepoint.com/what-java-might-one-day-look-like/) it will be interesting to see how many applications still actually benefit from off-heap storage.

npgall commented 7 years ago

So as discussed, I will close this issue for now. Let's hope value objects will actually be useful in future!