snowmanam2 / SnowmanPolycubeGenerator

3D polycube generator written in C
MIT License
1 stars 1 forks source link

PolycubeGenerator discussion #1

Open JATothrim opened 1 year ago

JATothrim commented 1 year ago

Thats about it: This is issue for discussion on @snowmanam2 SnowmanPolycubeGenerator.

See discussion before: https://github.com/mikepound/opencubes/issues/49#issue-1872820243

JATothrim commented 1 year ago

@snowmanam2

For the data race / locking stuff, I actually tracked down the fairly cryptic ThreadSanitizer warnings I got to overflowing the input buffer in the new zlib input stream implementation. There was also unprotected access to the progress bar "total_input_index" variable, though ThreadSanitizer made that really easy to find. Changes were pushed last night for that specifically, though I know I should probably rework the locking logic regardless.

Some explanation: the thread pool itself actually has 2 buffers - an "output" buffer and a "write" buffer. When the "output" buffer fills, the pointers get swapped and the unlucky worker goes to work on writing the data from the "write" buffer. While the worker is working, the "output" buffer is free to accept data from the other workers. Because the buffers are equal size, the hope is that the fresh "output" buffer won't fill by the time the "write" buffer has been written. The output lock is unlocked once the output buffer / variables are free to modify, but the write lock is kept to keep other threads from messing with the write buffer. By moving the output lock below the write lock release, the effect is actually to lock all threads after generation during the entire write process (quick test for N=12 showed ~40% slower, some N=13 tests only ~10% slower). Not sure if moving the buffering to the workers will really help much.

There might be a way to do parallel compression. Pigz does just that by keeping the preceding block 32k of uncompressed data per thread to generate context. If running big enough blocks, maybe the performance would be worth it? Then again, I can only imagine the complexity required to actually pull this off.

As a test of trying to parallelize the pack + write step, I made a very rough testing branch (test-pwrite) to see how much I can decouple the threads. I moved all of the pack + write work into the workers finishing in pwrite calls, and the only shared state for output is the total count and the write offset. It seems to work pretty well, though output compression no longer works as a temporary side effect. Even if this isn't the way to go, it's still probably best to keep the packing / serialization work in the workers.

Above for reference.

The ThreadPool struct is becoming quite cryptic to read... I did eventually figure out that there was double-buffering hidden in the code. :-) The unlock moving was a temporal hack to check if it fixed the data race.

The test-pwrite branch scales now extremely well to near 1600%:

./polycube_generator 14 -t 16 -o cubes14.pcube
Starting file writer in PCube mode.
Using thread pool with 1 threads to generate n=9 from n=2
48311 intermediate polycubes found of length 9
Using thread pool with 16 threads to generate n=14 from n=9
1039496297 polycubes found of length 14                      
209 seconds elapsed

There is an bug in polycube_generator.c:188: The local variable goes out of scope, but it is assigned to the thread pool. Fixing that allows the above command to run.