Closed knutwannheden closed 7 years ago
Hi Knut,
This sounds useful. If you could create a PR which adds the additional node type for each of the factories then it sounds like a useful inclusion!
Hi Niall,
DefaultByteArrayNodeFactory
currently already returns one of six different implementations. This number would now go up to twelve. (Any more and I think it might make sense to generate the classes.) I am thinking of using the same naming scheme, but replace the prefix ByteArray
with SingleByte
. The alternative would be to have the prefix Byte
only. Sounds good to you?
The getIncomingEdge()
method has to return a CharSequence
. For this I will create a very simple implementation, but I am wondering if it would make sense to add a cache like used by Java's Character#valueOf(char)
method.
Would you want to also implement this for the DefaultCharArrayNodeFactory
?
Detail: I noticed the DefaultByteArrayNodeFactory
and its implementation classes refer to UTF-8 in the Javadoc. But AFAIK any code point beyond 0x9F needs two bytes when encoded using UTF-8. So I think referring to the ISO-8859-1 encoding would be more appropriate.
This has become quite a heavy proposal :/
I was originally thinking we might only have one new node implementation per type of NodeFactory, but I see now we'd need 12 new classes. It's a lot of code for a minor improvement.
I'm thinking that for a version 3.x of this project I might add support to store the trees on disk, or in off-heap memory. So this would reduce the need to optimize for every single byte of memory usage on the heap. Is the current memory usage a big problem, or do you think you could wait for off-heap/disk support in future?
I almost expected you to consider this change a bit heavyweight for what it achieves. That is perfectly understandable and not any problem for me, since I can easily supply my own NodeFactory as it is.
What you say with regards to off-heap memory sounds very interesting! Especially since the data I am currently indexing is already off-heap. I will be interested to see how you plan to organize the data in memory, because I have personally also already considered an off-heap implementation for RadixTree. Also, do you already have a target release date for 3.0?
If Java 9 comes with Value Objects (see e.g. https://www.sitepoint.com/what-java-might-one-day-look-like/) it will be interesting to see how many applications still actually benefit from off-heap storage.
So as discussed, I will close this issue for now. Let's hope value objects will actually be useful in future!
This request may also apply to other
NodeFactory
implementations, but so far I have only been using theByteArrayNodeFactory
andCharArrayNodeFactory
implementations.In my use case I build a tree for 1.5M+ mappings resulting in a RadixTree with 1.9M+ nodes. Of these nodes more than 33% have an incoming edge containing only a single character. Wrapping that single byte (or char) in an array incurs a rather big overhead: 4 bytes for the reference + 16 bytes for the array (12 object overhead, 1 byte payload, 3 bytes of padding). So a total of 20 bytes vs. a single byte (or two bytes for
CharArrayNodeFactory
). Depending on the padding required for the concrete node implementation I suppose one could save between 16 and 20 bytes of memory, when dealing with nodes with a single character in the incoming edge. For my particular tree that makes quite a difference.Since I can supply my own
NodeFactory
implementation I can of course extend the mentioned implementations accordingly. I was just wondering if this improvement may be of general interest.