pcube file format discussion

datdenkikniet commented 1 year ago

For (eventual) implementations that may support reading & writing cubes directly from disk to avoid having to store them in RAM, it would be really beneficial if we can write the count of cubes at the end of the file as well.

If writing the count to the end of the file is supported, cubes can be written to it in a streaming fashion and the count can be added at the end. If it's not supported, you have to rewrite the entire file from the beginning in order to fit the LEB128 encoding into the header.

Is there an easy way to support this?

I think we should definitely add a byte to the header that is just flags, with 1 bit explicitly reserved for increasing said header size (if we ever find more than 7 flags we need).

bertie2 commented 1 year ago

so there is the proviso in the format that a cube_count of zero means ongoing stream. I don't see how having the cube count at the end would help as reading to the end to read the cube count would be slower than just reading each cube as it comes? or is their a fast way to skip straight to the end of the file ?

datdenkikniet commented 1 year ago

Yes, the streaming format I'm aware of.

There is no fast way to skip to the end of the file, but we could encode the length of a trailer and the type(s) of data that are in the trailer in the header.

The main purpose would be where the program is writing a very large file in a streaming fashion while keeping track of the count. Since the count is already known, it would save a bunch of effort if we could just tack it on to the end instead of requiring either an entire rewrite of the file, or requiring that whoever opens the file must be OK with not knowing the amount of cubes in the file from the get-go. In my case, not knowing the amount of cubes from the start makes my parallel implementation go wonky, and being able to skip reading the file just makes it easier.

I can understand if the complexity is not warranted though.

Still, I do think we should add a reserved header byte for future expansion opportunities! Even if we don't have it now, being able to retroactively add new features without breaking old files is a good idea, IMO.

datdenkikniet commented 1 year ago

Okay, I have realized that my specific use case of "putting the cube count at the end" may not be super useful. I have, however, come up with a different thing that definitly requires the same infrastructure: writing blocks of cubes of the same size. This would entail:

A bit flag in the header indicating this format (we don't have any reserved bits at the moment that we could use for this, so we need a flag-byte).
For every block: an xyz in size, and a LEB128 number indicating the amount of cubes of that size following.

This would mean reducing the amount of data stored per cube 3 bytes, and perhaps increase the file density significantly.

This would also allow for far more efficient in-file deduplication

datdenkikniet commented 1 year ago

Closing in favor of #8

mikepound / opencubes

pcube file format discussion #23