vigna / fastutil

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues.
Apache License 2.0
1.78k stars 196 forks source link

Support for N-Byte-backing primitive collections. #289

Closed Z-Kris closed 1 year ago

Z-Kris commented 1 year ago

This is primarily an "issue" with maps however it could be applied to a variety of other collections, too.

Would love to see an expansion on the possible types. Right now we have the basics covered - bytes, shorts, ints, longs, however there are gaps here. In particular 3-byte, 5-7 byte and 9+ byte structures. I've found myself in scenarios where having e.g. an Int2MediumMap(medium being what Netty calls 3-byte integers) could've ended up shaving off over a hundred megabytes but due to the lack of support for one, been forced to stick to Int2Int maps and unnecessarily waste a byte... often in collections which are millions in size.

As for how to add support for this, I'd recommend the following: Currently, e.g. a Byte2IntMap ends up storing the keys in a byte array and the values in an int array, both of which are the same length. But what if they weren't? If you had a Byte2MediumMap, you could just have the values be 3x in length but be represented in bytes behind the scenes. Sure, this would mean you'd have to always do three array reads and writes whenever reading or writing to it, but that is a relatively low cost when memory is the issue you're trying to solve. These specialized collections would use the next-up primitive type to represent the values when getting/setting them. E.g. for a medium-based collection, you'd pass it in as an int. For a 5-7 byte collection, you'd pass it in as a long. In both cases however, there should ideally be a check to ensure there aren't any bits set out of boundaries of the given collection. As for the last one, variable-size, since there is no primitive that could support it, it should be represented by a byte array. It would indeed be producing garbage when passing/retrieving the values, however if the underlying storage is the thing you are trying to optimize, this would be a non-issue. I'm not entirely sure if the latter is even necessary, but as far as I can tell, it doesn't seem like that much work to implement it after the other types are done, I'm sure someone could find a use for it.

There are two small issues with this approach you might've already thought of: 1) It'd be unclear whether the values are signed or unsigned. IMO, the documentation should state that the values are always unsigned for these specialized collections, leaving it up to the developer themselves to sign it if that is what they actually need. I'm recommending unsigned as signing these values would require some extra operations to achieve - and if the developer does not want it signed, it'd just be wasted CPU time. 2) Other than a Medium for the 3-byte version, I've no idea what you'd call these other versions.

vigna commented 1 year ago

Such maps (3-byte) would allow three times less keys than all the other maps, introducing a lot of unpredictability. For the other case, even worse. I don't think that's something we want.

One possibility could be to implement n-bit keys and m-bit values using a LongArrayBitVector to store them. It could be even faster than reading byte by byte and under 64 bit the maps would allow the same number of keys (in fact, even more keys). One could even think of interleaving keys and values, thus reducing cache pressure, maybe in that case using a LongBigArrayBigVector.

Z-Kris commented 1 year ago

That's indeed a better solution, I didn't really think of the length problem, my individual collections never really get into the hundreds of millions in length so it wasn't something I had to ever consider.