Rust disk-only computation test

datdenkikniet commented 1 year ago

This is on top of #21.

Seeing if getting the memory size down by never storing many full cubes and instead streaming them from and to disk in memory is effective or not. Am comparing to points-list with size16, since that seems like it had a pretty small footprint. Important to see if the tradeoff between memory usage and slowdown (I'm fairly sure this will be a tad slower) is worth it.

Results so far (definitely far less memory used :D):

Implementation	N = 12	N=13
Disk-only	500 MB	3.6 GB
points-list (size16)	1.2GB	9.3 GB

TODO:

[ ] Figure out what to do on collisions (probably easily solved by associating a file offset with a specific hash, and doing a lookup that way). Note for the future: just using the offset in the file as the hash value may be a valid strategy? Lookups may be slow, but at least very easy. Worth trying
[ ] Don't immediately start overwriting the file for N = X when starting for that N

NailLegProcessorDivide commented 1 year ago

Interesting strategy and looks like a good potential saving. Doesnt change anything in your testing but just as an fyi size16 only effects opti-bit-set not point-list

datdenkikniet commented 1 year ago

6:15 hours into N=15 it ran out of memory (96 GB). Seems like this is not a terrible approach, but it does seem like the hash function has some problems: I calculated 1039496296 unique expansions for N = 14, which is one item too little :/

NailLegProcessorDivide commented 1 year ago

One option could be t use similar or better buckets to what my patch does for parallelising, generate an output file pcube per one, then reload the now smaller files individually to de-duplicate. im not sure what ratio of duplicates we're filtering and it could cause a potentially large increase in the amount of data we're storing on the disk.

or something like store a pcube file per low byte of the hash and not fully deduplicate it on the first pass

datdenkikniet commented 1 year ago

Yeah! Just storing a few extra copies of a specific canonical expansion and filtering them out later on sounds like a good alternative when it comes to memory usage (will use a bunch more disk though, but that should be fine)

bertie2 commented 12 months ago

@datdenkikniet can this be updated or closed ?

datdenkikniet commented 12 months ago

I will close this until I get around to trying this out again :)

mikepound / opencubes

Rust disk-only computation test #24