More buffers - Githubissues

datdenkikniet commented 1 year ago

Add some write buffering to the Rust version to significantly speed up write speeds, and an option to count the cubes in a file when converting stream-oriented files.

Also add a "prefill" option to the PCube header so that we can write a fixed-size header.

datdenkikniet commented 1 year ago

@NailLegProcessorDivide if you could take a look at this I'd appreciate it :) It makes some changes that are important for calculating larger sizes of N.

NailLegProcessorDivide commented 1 year ago

is this mostly for getting ready to stream to files? I dont quite understand the LEB128 thing.

datdenkikniet commented 1 year ago

The leb128 change makes it so that we can write a header of a predetermined length (by default, leb128 makes it so you can write smaller numbers in a fewer bytes. In this case, we already know the theoretical limit (u64), and saving 9 bytes on a file of 600 GB is pointless).

Now we can write a stream of cubes to the file while keeping a running count, and set it to the correct value at the end. Makes it so we don't have to re-write an entire file once we know the length (there is, AFAIK, no way to "shift a file to the right" without rewriting all data following it). Not having to re-write and/or re-read the entire file if we're well aware that we will eventually know how many cubes we've written is nice.

To illustrate: Before:

Write header of length 1 (to indicate length 0) + N (other bytes) and 10123014 cubes
Rewrite entire file just to write a header of length M (leb128 encoding of 10123014) + N

Now:

Write header of length 10 (zero, but padded) + N (other bytes) and 10123014 cubes
Rewrite header to update length field (still 10 bytes, possibly padded).

I'm not sure why the format uses leb128 (just a u128 or u64 would have avoided having to do it this messily), but :shrug:

This could technically mess up implementations that don't have the read function implemented (correctly!) as we have it here, but there is nothing about leb128 encoding that says you can't do this (i.e. encoding a number with a fixed amount of bytes).

Main motivation is that I don't want to wait 2.5 hours just to see if the file I wrote for N = 16 thinks it contains 59795121480 polycubes.

It's especially relevant when generating new cubesets, since the whole point is that you can't keep them all in memory and there is no way to know how many there will be in advance.

datdenkikniet commented 1 year ago

Oh and the rest is for perf improvements (unbuffered writes to disk are slow) and bug fixes.

NailLegProcessorDivide commented 1 year ago

ok yep, I hadnt realised the length field was variable size, that is a choice given the file is unstoreable even close to a u64 of polycubes.

Rest of the changes look good to me.

datdenkikniet commented 1 year ago

@bertie2 I think this is ready for merge :) I have rebased the commits to clean up the history a bit (and would prefer a non-squash this time around).

I have also verified that the funky trick (ref) for getting a fixed-size header works fine with the leb128 python library, so I hope you're OK with it.

bertie2 commented 1 year ago

apologies for the delay in merging this partly life has gotten in the way, partly I have become fixated on methods for distributing the work. all looks good, and runs a good bit faster on my machine. also yes the files still seems to be compatible, I guess the python library handles extraneous blocks of zeros in case another implementation uses them. hope to get back to you soon with the V2 .pcube format asap as I know the current limitations are proving annoying.

datdenkikniet commented 1 year ago

That sounds cool! Would like to see that too, the way the DFS algorithm works seems to support distrubution well :)

Also no worries about delays, things happen I will gladly take any response, especially as quickly as 3 days! Thanks for merging.

mikepound / opencubes

More buffers #39