nsch0e commented 1 year ago

This implementation is completely separate from the python implementation. It uses simpler representation than RLE, by just using a list of the indices of all ones (like .nonzero() in numpy).

bertie2 commented 1 year ago

holy hell this is fast, both in how quick you've made it, and in runtime, very nice. unfortunately it still seems to slam into a memory wall at n=13, but much less so than the python version. when would you be happy for full feedback and review on this ? don't want to spam you since this pr is still marked as draft.

nsch0e commented 1 year ago

I'm not sure how the process should be. There are things I would like to work on:

[x] still redundant structures (vector<> and ordered_set<>) which both hold all cubes.
- fixed for single threaded
[x] No caching implemented
[x] multithreading is working and doesn't inflate memory footprint much
[x] There is an interesting optimization in #5 which could be incorporated
- I tried this, but there was no speed up

I'm open for feedback. I just selected Draft because the code isn't commented or structured well. Just what I coded down.

NobodyForNothing commented 1 year ago

As the memory seems to be a main issue, it would optimal to only hold the cubes in memory that are currently needed. The way I see it you don't need a cube after expanding it.

bertie2 commented 1 year ago

happy to merge this after some cleanup and commenting, would you rather I go through and point stuff out line by line in the review, or just make the changes and pull-request my changes into your repo.

this will not be replacing the python implementation but will be parallel to it.

nsch0e commented 1 year ago

@NobodyForNothing I'm really not sure where the line is for the memory footprint. At some point the problem will get big. We can try to be efficient, but for "removing cubes after expanding" to be efficient we have to consider other data structures, so removing single items will really decrease size (eg. deque?). Perhaps a pool for all XYZ structs is more efficient, because for each iteration N there are exactly N x XYZ structs in a Cube.

@bertie2 I think pull-requests to my repo are more efficient than line by line discussion.

mikepound commented 1 year ago

What's the current memory footprint e.g. at n=13? I have some servers available ;)

Some of the discussions elsewhere are thinking about how one could shard work into different processes, or distribute across machines. To pull that off we'd probably need to find a way of identifying the minimum set of all cubes from a set of sets of cubes produced on different machines. There's almost certainly a known algorithm for this from big data - maybe future work ;)

nsch0e commented 1 year ago

What's the current memory footprint e.g. at n=13? I have some servers available ;)

hard to say :) I only have 16G so I can't test everything:

N   SingleTh    4Threads    20 Threads
12  2.5GiB                  10.7GiB
13  10%=9GiB   10%=10GiB                     but the base set is already 2.5GiB from N=12

Memory seems to grow linearly with progress...

mpitof commented 1 year ago

For what it's worth I've run it single-threaded on my MacBook Pro with 32 GB memory, and N = 13 took ~12 hours (commit cf6e413).

N = 13 || generating new cubes from 18598427 base cubes.
  took 43899.52 s
  num cubes: 138462649

It's currently calculating N = 14 and consumes about ~30GB of memory, though it's rapidly increasing so I suspect it'll hit a wall here soon.

N = 14 || generating new cubes from 138462649 base cubes.
   2%   928 it/s, remaining: 145890s

nsch0e commented 1 year ago

so I updated the code to use synchronized access to a single set instead of multiple sets when more threads are used. So memory footprint should be more or less the same and not depend on number of threads. Since all threads use the same resource (the set) they are blocking each other while accessing the set. This will be the limiting factor for speed when using many threads.

bertie2 commented 1 year ago

open pull request from me @ https://github.com/nsch0e/opencubes/pull/1 gets us pretty close to single threaded performance both in terms of memory usage and speed while running multi threaded.

@mikepound n=13 is easily doable at 32GB of ram, and we seem to have the same growth pattern as the files, ~~7x each time. interestingly in my testing the program works REALY WELL with a swap file (at least on my nvme ssd and at n=13) so if you have a server with 32-64GB of ram and a fast NVME SSD, you should be able to allocate a multi-hundred-GB swap file and push n very high. for only a fairly minor slowdown. performance craters once swap usage gets >= ram, still continuing at 800 base cubes / thread x 16 threads at n=14 but much slower.

running up to n=15 numbers myself now with 32GB of ram and 256GB of swap, n=13 is taking less than 10 minutes.

nsch0e commented 1 year ago

It is to be expected that the "performance" in baseCubes/s is lower when N gets bigger because each baseCube has more potential children. So much more new cubes can be found per baseCube. I again try to find a way to use multiple sets depending on shape. this could improve the blocking of threads.

VladimirFokow commented 1 year ago

Thanks to https://github.com/mikepound/opencubes/issues/5#issue-1802085753 for finding it, this paper really gives a lot of useful info: http://kevingong.com/Polyominoes/ParallelPoly.html.

For what it's worth, here are some of my conclusions & thoughts from it:

Do we really want to store all cubes of the record-breaking n, or is just calculating the number enough? If so, then we could try to get clever and differentiate the polycubes into non-intersecting sets - to save on RAM.

One such differentiating metric can be shape (rotated to some canon, e.g. increasing numbers, ensuring it's not mirrored) - but we don't necessarily have to use shape as a metric. And we can use several metrics. But if we can't predict in advance: e.g. for a given cube at level n=14 what metrics the derived cubes at n=16 would fall under, then using several metrics wouldn't be useful; we can just use one metric: modulo of some hashing algorithm (this can be any algorithm different from the one used for putting the elements into the set). And we can change the modulation number depending on how much RAM we have - to control the number of such groups - and thus, the sizes of each set. For the further example, let's say the modulation number k = 1000.

Also, we can only pre-compute cache until a certain level n (max. that our storage capacity allows), and then continue using this cache as a starting point of every run, not caching the subsequent layers. So their information will be lost after finishing the algorithm. - Why not cache the next layers? - because their size in MB will be ridiculous (more than we can hold), while the numbers of polycubes will be calculated in parallel on multiple machines - so this giant size will never actually be stored in one place at the same time. We can compute the sum of unique polycubes, but not store them.

Example:

We want to calculate the number for n=17. The maximum cache that we have the capacity to store is for n=14. This is our starting cache.

So the workflow can be as follows:

distribute the starting cache into several machines, e.g. each of the 5 machines receives a random 20% portion.
for each possible modulo remainder (in our case k=1000, see above, so, for each number from 0 to 999):
    on each machine:
        current_set = empty_set()
        for each subset of cubes of the local portion of starting cache (each machine should know its appropriate subset size - to not overwhelm the RAM in the next step. Subset size can be even 1):
            calculate all canonical representations of polycubes with n=17 for the cubes of the current subset (this will take several steps, since the starting point is only n=14)
            filter these polycubes: apply some hashing algorithm and calculate the modulo remainder, throw away everything that doesn't match the modulo remainder of the current loop
            add the polycubes into current_set

    // after all sets on all machines have been calculated:
    merge all sets from all machines together into a resulting set
    count the number of elements in this resulting set
    save this integer somewhere (can also save the current modulo remainder with it - for logging)

add up all these integers

since the set contains all the cubes that fall under the metric (modulo remainder in this example), and nothing else - we safely process such sets independently from each other
if one machine finishes early - don't wait for others to finish - export the set to some external place, and go on to the next modulo remainder

If the time is too long:

increase the starting cache, e.g. from n=14 to n=15 (but need more storage). This is "depth vs width balance" idea from the paper
or choose a higher subset size (need more RAM, because you probably use RAM for sets in the intermediate layers)
or use a smaller modulation number (k), e.g. not 1000 but 800 (each set will have more elements - need more RAM),
or use more parallelism: not only can we process the same modulo remainder on different machines and at the end just merge all these sets together to count the number of its elements, we can also process different modulo remainders on different machines.

Maybe some further improvements can be made - with ideas both from the paper and people. Or maybe some things from here are impractical or outdated.. At least, this is the edge of my understanding right now. Thank you for reading!

nsch0e commented 1 year ago

@VladimirFokow thanks for your thoughts. I can't say I understood them completely. But I can describe the current state of the C++ implementation:

(in the following shape means the bounding box of a polycube)

Each polycube has a rotation invariant representation:
- its shape has always the constrain: shape[0] <= shape[1] <= shape[2]
when calculating N we first determine all possible shapes
- each shape gets its own hash-set to filter duplicates
expanding one polycube(N) to N+1 can only result in a few shapes
- because at most one dimension is increasing while also dimension order must be preserved (shape constraint)
there is a branch select which splits the problem into separate output shapes and then runs each of them after another.
- this wastes many polycubes but doesn't need all the results in RAM (only one shape)
currently cachefiles contain all polycubes in flat order without distinction of shape
- I think the way towards bigger N needs splitted cachefiles by shape as input
- as temporary output from distributed systems the cachefiles should contain blocks by shape so they can be read selectively.

bertie2 commented 1 year ago

@nsch0e can I get your thumbs up to merge this into main as it stands, I'm happy with it as it, but just want to check you don't have changes in flight ?

nsch0e commented 1 year ago

@bertie2 I just did my last 3 small changes and am also happy with this state. Further improvements will have to wait. I have ideas but I think we should provide this as a starting point for others.

mikepound commented 1 year ago

This is looking great! For info I have access to some 50 core server with 1/2Tb ram - so if we need some testing doing.. ;)

mikepound commented 1 year ago

Fyi I got a nice tweet from @tjol on his implementation here:

https://github.com/tjol/polycubes

I'll point him in this direction.

bertie2 commented 1 year ago

@mikepound if growth continues as we expect then with 2TB of ram n=16 should be runnable with both this and tjol 's implementation as they stand right now (just make sure you also have 2TB of disk space to write out the result to), however n=17 wont.

the next step to get any higher will be to implement direct disk streaming, which I am working on in my own branch, but I want to get this and the rust implementation merged first as they stand so that others have something compatible to work from.

(I estimate you need 4-6 TB of space for n = 17 so it pretty much has to be disk streaming)

mikepound commented 1 year ago

Yes no rush! I also note that the enumeration has actually now been computed up to n=18, from wikipedia. Clearly my video prompted someone to update the page! Enumeration and generation are two very different things though.

tjol commented 1 year ago

I wouldn’t mind finding out how much RAM my version actually uses for N>14 and if it’s limited by disk speed (writing manageable chunks of the result to disk one by one is all well and good by my method of deduplication – looping through the entire file once for every now block of data – is actually quite slow, who knew)

JATothrim commented 1 year ago

I have queued some pull requests on @nsch0e cpp branch:

I would like to know are they still candidates to be pulled?

The optimizations had following impact: ./build/cubes -n 12 -t 2 245.05s user 1.88s system 180% cpu 2:16.76 total vs. ./build/cubes -n 12 -t 2 297.87s user 2.04s system 180% cpu 2:46.07 total

bertie2 commented 1 year ago

@JATothrim if you raise your changes as pull requests into this repo I will take a look, I think I would merge 10 on the spot, but would want from feedback from @nsch0e before merging 11.

nsch0e commented 1 year ago

I like both. :)

also, please feel free to improve the C++ implementation in this repo as you see fit. I don't think I can/want approve everything, as there are far more knowledgeable people than me for C++. ;)

bertie2 commented 11 months ago

This is looking great! For info I have access to some 50 core server with 1/2Tb ram - so if we need some testing doing.. ;)

@mikepound, not sure where to put this so replying here, the current latest C++ implementation with the split cache files should be able to produce all poly-cubes up to n=16 with 2TB of ram in about 2 days, as long as it also has ~3TB of storage available for the results, table of estimates here:

N	Peak Memory Usage (GB)	Total Output Size (GB)	Growth Factor
10	x	0.01
11	x	0.08	8
12	x	0.638	7.975
13	6	5	7.836
14	27	40.7	8.14
15	216	325.6
16	1728	2604.8
17	13824	20838.4
18	110592	166707.2
19	884736	1333657.6

the data for n=15 + is the estimate, the data for n=10 to n=14 is measured, x indicates i couldn't properly measure the memory usage.

whether this is actually worth running is up to you, at above n=16 it is pretty much required we move to a enumeration technique rather than trying to actually store the cubes.

currently only the rust version can do storage-less enumeration, and https://github.com/datdenkikniet has run it up to n=16. unfortunately using similar extrapolation, it would take about a year to beat n=18 currently, though possibly faster on your hardware:

N	Time Taken	Growth Factor	Estimated Time (days)
13	1.5
14	13	8.66666666666667
15	110	8.46153846153846
16	970	8.81818181818182
17	7760		5.38888888888889
18	62080		43.1111111111111
19	496640		344.888888888889
20	3973120		2759.11111111111

again need to decide if its worth starting now or we should wait for code improvements, for now the breakthroughs seem to have distinctly slowed down.

mikepound commented 11 months ago

This is already an amazing improvement! :) I'd seriously consider running n=16, if only so we can say we did it! 3Tb is a lot, but also not totally out of the question. The RAM is currently an issue, we have 1/2 Tb on our largest server, we may need to wait until we get a few more optimisations, though 75% memory saving may not count as "a few optimisations". I can certainly enumerate n=17, but if we are confident the code is working, all we can do there is verify the correctness of the paper.

I'd argue that enumeration and generation are two separate and entirely worthwhile goals. I.e., that the fact that cubes have been enumerated up to 19 doesn't mean we shouldn't generate a file of cubes up to n=16 (or 17 if we get really lucky with optimisations). Then we stick it one a web app that randomly servers you one of the billions of shapes each time you visit it - next video!

JATothrim commented 11 months ago

The split cache-files should be some what compressible by just zlib. This would reduce the disk space needed.

I'm working on "cube swapper" system that moves the cube XYZ list representation data out from RAM and instead stores it in disk via memory mapped file. The data is read in only when needed. A draw back is that this system is currently very slow as flushing the cubes out to disk periodically takes a long time. As proof-of-concept stage it does work. Used ram for computing the polycubes is minuscule compared to current situation: N=13 run going at 52/66 shapes, the process is showing less than 300MiB RSS. :-)

I noticed @nsch0e has done some experiments compressing the cube representation and I don't know about his plans on it.

nsch0e commented 11 months ago

Unfortunately life is getting in the way... :)

For what it's worth I have pushed my WIP on Compression to this branch. Idea is to encode pcubes as a string of direction changes. Since pcubes can have a tree like structure also jumps have to be encoded. Currently there is no spec description of the encoding so it is perhaps a bit difficult to understand, but if @JATothrim wants to have a look, feel free to proceed with my draft.

In this branch encoding is used in the hashset to save memory. compression on disk could be approx. reduce size to 20% (5x). Each voxel is represented by a nibble + jumps (mostly also a nibble, sometimes 2, when big jumps) when needed.

I don't think I will have much progress in the next month, so don't count on me...

nsch0e commented 11 months ago

so I tried to describe the compression:

PCube Compression

This compression uses the fact that polycubes have to be connected so most of the time we only have to specify a direction for the next cube in the polycube. Only when a string of cubes ends we use a jump instruction to begin the next branch of cubes.

Encoding:

+----------------+-----------+-----------+---------------------+
| 8 bits         | 4 bits    | 4 bits    | ...2*length nibbles |
+----------------+-----------+-----------+---------------------+
| length in Byte | [dir|jmp] | [dir|jmp] | ...                 |
+----------------+-----------+-----------+---------------------+

Nibbles

dir

0b0xxx has its highest bit fixed at zero. Lower 3 bits are the index of the direction. direction table:

table = [
  { 0,  0,  1},
  { 0,  0, -1},
  { 0,  1,  0}, 
  { 0, -1,  0},
  { 1,  0,  0},
  {-1,  0,  0}
]

jmp

0b1xxx has its highest bit fixed at one. Lower 3 bits are the reverse jump distance minus 1. So the jump 0b1000 references the second last encoded cube. jmp instructions can be concatinated to represent bigger jumps. Each following jump instruction shifts previous jump distance by 3 bits. example:

encoded jump:  1001 1010
jump distance:  001  010 = 0b1010

When a polycube is encoded by an odd number of nibble a jump instruction 0b1000 can be filled at the end, because it doesn't add a cube.

JATothrim commented 11 months ago

@nsch0e Thank you for describling the compression scheme. :)

mikepound / opencubes

C++ implementation #7

For what it's worth, here are some of my conclusions & thoughts from it:

Example:

PCube Compression

Nibbles

dir

jmp