Parallel Operation - Githubissues

william-silversmith commented 5 years ago

For large volumes, it would be nice to process it in parallel.

william-silversmith commented 5 years ago

The rendering phase is trivial to parallelize. There are a few ways to go about the labeling phase.

Process blocks independently with independent disjoint sets, compute the equivalencies along the borders of these blocks, and merge the disjoint sets.
Use one disjoint set and be careful about controlling access to updates via locks. This sounds like it would be trickier to write and slower because of lock contention.

william-silversmith commented 5 years ago

This design is a little more tricky than one might think. If you make a separate DisjointSet per thread, it incurs a large memory overhead unless we change the design. Experiments with std::unordered_map have not been promising, but it's likely I was using it wrong (several times slower, large memory usage).

william-silversmith commented 5 years ago

I think I have a good method for doing parallel, but it's so fast already... 70 MVx/sec/core isn't bad. c.f. #12

matthew-walters commented 4 years ago

I've written a similar library in C# for voxel game physics (C# because I'm using Unity3d). I'm only using 6-neighborhood in mine. I came across this library today, it's about 2.5 times as fast as mine. I'm currently in the process of redoing mine in C++ and importing into C# as DLL to see if it improves my speed.

I've given a bunch of thought about how to parallelize without sharing data between threads and here are a few approaches, I've come up with, in case you're interested: (I'm still groking your source code, not certain how applicable these are to your algorithm)

Process each xz plane separately and in parallel, not considering neighbors in y direction. Once all planes are done, then process up through the y's for each (x,z) coord, in parallel. This method seems to produce too many temporary labels.
Process each xz plane separately and in parallel, not considering neighbors in y direction. Once all planes are done, process pairs of adjacent planes in parallel, only considering ydirection. once this is done then process again in parallel, but staggering by 1. for example if ysize = 8, then first process pairs (0,1), (2,3), (4,5), (6,7) in parallel. once all are finished, then process pairs (1, 2), (3, 4), (5,6) in parallel
let empty_voxels have value -1 and non-empty, non-processed voxels have value 0. now, as above, process each xz plane separately and in parallel, this time considering y direction as well. if we get to the case where current voxel is comparing to voxel in yDir and that other voxel has not been processed (value is still 0), then thread waits 1ms and tries again, in a loop.

of course rather than xz planes and yDir, can do xy planes and zDir

william-silversmith commented 4 years ago

These are great approaches for 6 connected, which despite being the simplest version, is also often the slowest because you can't exploit connections between the arms of the stencil like you can in 18 and 26 connected.

I think a similar thing could be done with blocks instead of planes while retaining generality for the other connected versions. The thing is though, so far cc3d has been fast enough that parallel hasn't been needed for me. What target time and volume size are you working with?

I find the biggest time suck in most of these 3D algorithms is the Z pass (assuming Fortran order) where the cache misses cause the CPU to spend most of its time waiting for main memory. Parallel probably helps with that.

One trick you may have overlooked, which I discovered by accident, is that the union-find data structure should be using full path compression and no balancing by rank. This is because during the equivalence pass, we kind of don't care how bulky the subtrees get so long as the union operation is fast. Then, when we apply find during the render phase, the subtrees get collapsed quickly from path compression and repeated evaluation becomes rapid.

This division of labor in the union-find data-structure is particular to connected components and thus this trick is not generally applicable beyond this problem so far as I know.

matthew-walters commented 4 years ago

I found that union-by-rank and union-by-size gave me no speedup, but i'm doing path compression each time I mark two labels as equivalent.

Right now, the largest game world I'm using is 512x512x64 (LxWxH). Probably in contrast to brain imaging data, most of the action is nearer to the ground; get closer to max height and it's mostly empty voxels (sky) (except for the top of skyscrapers, etc)

Your CCL algorithm (called as a DLL from C#) takes about 50ms to complete for this size (mine, coded in C#, takes about 115ms). both are sufficiently quick when the level is loading (it takes longer to render the voxels independently of this calculation, for one thing). But, I'm focused now on deleting voxels and then recalculating the component(s) to determine two things: 1) has one component now become two (or more) (imagine a big letter "H" and then I delete the voxel in the middle, now I have a "|-" and a separate "-|"),

2) is the component no longer touching the ground.

Currently for the dynamic part, I have a second version of the CCL algorithm where instead of iterating over all voxels in the world, iterate only over voxels in the list belonging to the label of the voxel just deleted (minus the deleted voxel itself). I'd prefer just to recalculate the whole world every time though as then I wouldn't have to maintain two versions of the algorithm (static -- all voxels in world, dynamic -- voxels in the particular component that just had voxel deleted), plus it should lend itself more easily to cases where voxels are being deleted in many different components at once. But for this to be viable, I'd really like to get the calculation time under < 16.66 ms since I'm trying to target 60 FPS.

william-silversmith commented 4 years ago

This is a really interesting use case.

For a single threaded application, a single linear scan of a uint32 512x512x64 volume takes about 26 msec. uint16 takes 18 msec and a uint8 volume takes 8 msec on my MBP but they are too small to hold the maximum possible number of labels in the image. This leads me to believe, like you indicated, that parallel is the only way to get 60 FPS using a generic algorithm. However, I suspect that you'll need to share data between threads because you must avoid a full single threaded scan of the volume even to render the final labels.

In terms of improved single threaded algorithms, it's possible to do better than mine using Block Based Decision Trees by Grana et al, but that team wrote their algorithm for 2D and the (somewhat hairy) logic would have to be extended to 3D.

william-silversmith commented 4 years ago

I was looking at the code again. I think it should be possible to simultaneously process blocks in a possibly lock free manner by setting next_label to different offsets. The relabeling step can be accomplished in a parallel fashion using a reader-writer lock in DisjointSet.root so long as it is acceptable for the final values in the array to not be contiguous.

william-silversmith commented 4 years ago

I'm a bit jazzed about this parallel implementation concept, but I suspect it won't help your use case. What are the minimum requirements for your users? Just for cc3d, it looks like to hit 60 FPS using perfect parallelization with no algorithm overhead you'd need 3 fully occupied cores. I'm gonna guess w/ overhead, you'll need 4 cores. The only way to make the problem tractable is to a) use a GPU implementation or b) shrink the target area (like you've done with your second implementation). If you can shrink the size of the world to be evaluated by a factor of 4, you can use a single threaded general implementation.

matthew-walters commented 4 years ago

I don't have any users right now, is a personal project so far =)

I tried a compute shader in Unity3d but I found it too slow transferring the data back from the GPU to CPU. I'd much prefer a CPU solution anyways.

There's two ways I'm computing the labels for CCL algorithm (independent of static vs dynamic) If I let the CCL run against the standard x,y,z, , then the whole level of nonempty voxels ends up as one component.

The alternative is to have a user-defined yFloor so that the ground below is not considered for labeling during CCL, which results in for example, every office building in the level being its own component instead of all office buildings connected to each other through the ground and road.

When everything is separated, I can handle easily. But for one-label case, with a level that's say 512x64x512, when deleting a voxel, it's slower to iterate through the list of voxels in that label (in the case all nonempty voxels in the level, at least for the first delete) than just to iterate through each of the three coordinates, the standard CCL . Both are too slow for me right now to be usable. So I use the yFloor to separate into components and only recalculate that one component which has a voxel that is deleted,
Starting the whole level as one label, that the CCL calculates normally, seems more natural to me. Otherwise how could my project handle earthquakes. On the other hand, it would be nice not to worry about deletions, just recalculate every time

I should probably separate the game world into 8 pieces, one octree level i guess, and keep track of when voxels in components in one piece are connected to a component in another piece

william-silversmith commented 4 years ago

I'm a bit time constrained at the moment, but I'm going to try to implement the parallel version of this. It might take a few days or weeks before you see anything useful though. I've gotten parallel requests from people running enormous multi-GB scientific datasets in the past, so I'm sure they'll appreciate it too. Who knows, maybe this will be my excuse for attempting to do 3D Block Based Decision Trees which would probably increase single core performance by ~1/3 (though it's hard to extrapolate the 2D numbers to 3D since the mathematical topology and memory latencies are so different).

william-silversmith commented 4 years ago

Dang. I was curious and did some reading and realized that I had been thinking (for some reason in contradiction to what I wrote in the README) that this library is in the same column as He's 2007 algorithm, but it really implements a 3D variation of Wu's 2005 algorithm. There's actually substantially more to be gained by improving the single core performance.

william-silversmith commented 3 years ago

@matthew-walters I'm not sure if you're still experimenting, but the six connected algorithm has gotten significantly faster in recent releases. I'd be interested to hear how it's doing for you.

There were two major improvements (a) the use of the phantom label technique detailed in the readme (worth up to 1.5x speed) and (b) an improvement to the relabeling phase which is worth up to 1.3x. That should hopefully reduce 50 msec to around 25 msec without the use of parallel and would give you 40 FPS updates.

william-silversmith commented 3 years ago

~In a small test on random data I saw 28msec (36 FPS) and on some of my structured connectomics data I saw 10 msec (100 FPS) for that size image.~ Rather that was for 512x512x64. I saw worse results for 512x64x512 (46 msec, 21 FPS) on connectomics data.

matthew-walters commented 3 years ago

Thanks for the update. Looks great. I'm busy with some other projects right now, but I should get back to the voxel physics project in the new year :)

schlegelp commented 2 years ago

Hi Will! I'm currently running cc3d.connected_components on a (7062, 16714, 31019) uint8 on-disk array - i.e. something I don't want to hold in memory in its entirety.

My approach is to process the volume in chunks of 1000 x 1000 x 1000 voxels and distribute the chunks across multiple cores. Each process roughly does this:

Load the chunk into memory.
Run connected_components to extract labels.
Wait for the previous chunk to finish and report the number of unique labels (i.e. an offset) encountered so far.
Add that offset from prior processes to the labels.
Update and report a new offset to the next process.
Save the labels to an on-disk Zarr array.

The hand-over period in steps 3-5 slows things down a little but it's not too bad. The chunking also means that I will have to do some post-processing to "knit" labels at the seams between the chunks.

Some snags I countered:

It would be great if I could enforce the data type of the labels returned by cc3d.connected_components - otherwise I have to occasionally upcast (from uint16 to the uint32 in my case) which is fairly expensive. I tried using the max_labels parameter but I didn't get it to always return uint32.
Adding the offset can also be quite expensive. I was wondering if cc3d.connected_components could accept an offset parameter?

I'm sure you must also have processed large volumes in chunks at some points. Curious to hear what your strategy is?

william-silversmith commented 2 years ago

Hi Philipp,

I've thought about this a little and wrote my thoughts here: https://github.com/seung-lab/igneous/issues/109 However, no one has yet asked me to actually do this so I haven't implemented anything. I think Stephan Gerhardt may have taken a stab at it once, but I never heard how far he got.

The path I'd take would be similar to this scidb paper with maybe my own spin on it.

Oloso, A., Kuo, K.-S., Clune, T., Brown, P., Poliakov, A., Yu, H.: Implementing connected component labeling as a user defined operator for SciDB. In: 2016 IEEE International Conference on Big Data (Big Data). pp. 2948–2952. IEEE, Washington DC,USA (2016). https://doi.org/10.1109/BigData.2016.7840945.

With respect to (1) older versions of cc3d did have that feature before I had a method to be very confident about the minimum size necessary. Version 1.14.0 was the last one that supported it. I'll think about reintroducing it (and throw an error if the provided out_dtype is predicted to be too small). (2) By too expensive, do you mean in memory or time? If memory, I think the only problem there is incrementing everything except zero. I think you can use numpy += and fastremap.mask to mask zero. I'd rather not add another parameter for a task that can be achieved via composition.

schlegelp commented 2 years ago

Haha glad I'm not the only mad person to think about that application.

Another thought: not knowing a thing about how the C++ code works under the hood but would it be possible for connected_components to work on on-disk arrays (like Zarr or N5) for both in- and output? That would avoid having to chunk + knit the data.

Re 1) That'd be great, thanks!

Re 2) Time - I haven't even checked memory to be honest. I am effectively using a mask and numpy +=. It works fine as is but I thought it would be faster if cc3d just started with offset instead of 0.

william-silversmith commented 2 years ago

It's certainly something I think will start to see more use cases once we have an easy way to do it.

Another thought: not knowing a thing about how the C++ code works under the hood but would it be possible for connected_components to work on on-disk arrays (like Zarr or N5) for both in- and output? That would avoid having to chunk + knit the data.

This works to effectively reduce the memory usage of cc3d by removing the factor of the size of the input array and possible the output array. However, the size of the union-find array continues to exist. So the way to look at the strategy is that it would increase the size of arrays that could be processed on a single machine by a constant factor (approximately 2x). Removing the input file can already be done by mmapping a single contiguous array. In order to make this actually low memory, we'd need to estimate the number of provisional labels needed for the UF array.

This estimation could be accomplished by computing the estimated provisional labels in each chunk and summing them. This will result in a higher estimate than if full rows are used as chunk boundaries will count towards the estimate.

I can kind of see the outlines of a chunk based strategy where you pass in image chunks in a predetermined order and the algorithm only needs to retain neighboring image rows. cc3d would then return not an output image, but the mapping you need to apply to the larger image.

It's an interesting idea. It's not easy to parallelize though. For your 3.7 TB image, even if cc3d ran at full speed (~120 MVx/sec typically for 26-connected connectomics images) it would take 8.4 hours to process just to get the mappings. The chunking will slow it down some more, so probably more like 12-24 hours (maybe much more if any IO step is not optimized).

Re 2) Time - I haven't even checked memory to be honest. I am effectively using a mask and numpy +=. It works fine as is but I thought it would be faster if cc3d just started with offset instead of 0.

You're right that it would be faster because we'd condense two operations into one, but I'd prefer not to do it if I can avoid it because some of the other functions in this library (e.g. cc3d.statistics) rely on the max label being <= the size of the array.

I think you may want to take another look at fastremap.mask. The += operator is probably running at something like 2 GVx/sec so I doubt that's the bottleneck (should be something like 0.5 seconds). numpy masks are easy to write in a way that is very slow. Let me know if fastremap doesn't help. The algorithm used is kind of general but it should be better than numpy. If it's still too slow, it would be possible to special case masking a single label at very high speed in fastremap.

matthew-walters commented 2 years ago

@matthew-walters I'm not sure if you're still experimenting, but the six connected algorithm has gotten significantly faster in recent releases. I'd be interested to hear how it's doing for you.

There were two major improvements (a) the use of the phantom label technique detailed in the readme (worth up to 1.5x speed) and (b) an improvement to the relabeling phase which is worth up to 1.3x. That should hopefully reduce 50 msec to around 25 msec without the use of parallel and would give you 40 FPS updates.

Greetings. So I've been playing around with the previous version of this library I had been using as well as the latest version.
I'm actually finding no difference outside margin of error. On my current PC, for a 512x512x64 volume that represents a city with buildings, etc.

If I treat the entire volume as connected (every building eventually connects to the ground either directly or indirectly), it takes about 29 ms with either version.

If I consider the "ground" plane not to be connected (such that each building, etc ends up with its own label as they're no longer connected to each other through the ground), it's slightly faster, around 26 ms with either version.

I'm timing specifically the invocation of cc3d::connected_components3d<short, uint16_t>(labels, xSize, ySize, zSize, 6);

Maybe it has to do with the structure of my volume -- representing a city, rather that medical imaging data -- that causes some bottleneck.

Here's a screenshot for reference. For illustration, I've colored everything according to its label (but I'm reusing colors for multiple labels) Screenshot 2022-04-22 091922

william-silversmith commented 2 years ago

Hi Matthew,

Thanks for following up! Your null result is a little surprising. Your data look similar enough in image statistics to the medical data that you should see a benefit. In my opinion, the main reason you wouldn't see a benefit would be that your data are smaller (16.7MB) than the ones I typically work with (512x512x512 32 or 64 bit, 0.5-1 GB) so you might be getting a lot better cache performance that nullifies the benefits of the optimizations. In fact, the optimizations could be working against you at smaller sizes. You can try using the other version of cc3d::connected_components as they perform additional (fast) passes on the image. No guarantee you'll see a significant benefit, but maybe worth a try.

uint16_t* out = cc3d::connected_components3d<short, uint16_t>(labels, xSize, ySize, zSize, /*max_labels=*/xSize*ySize*zSize, /*connectivity=*/6, /*out_labels=*/NULL);

william-silversmith commented 2 years ago

Oh! One other thing you can try. Instead of generating a new out_labels each time, try pre-allocating an out_labels array and passing it in. You'll have to re-zero it after each pass, but at least you'll avoid allocation overhead.

william-silversmith commented 2 years ago

Okay lastly, I do have a prototype of parallel operation working, but it is clunky code-wise and requires further tuning. It was slower than single threaded in my tests, but I haven't exhausted all options are improving performance.

matthew-walters commented 2 years ago

Hi Matthew,

Thanks for following up! Your null result is a little surprising. Your data look similar enough in image statistics to the medical data that you should see a benefit. In my opinion, the main reason you wouldn't see a benefit would be that your data are smaller (16.7MB) than the ones I typically work with (512x512x512 32 or 64 bit, 0.5-1 GB) so you might be getting a lot better cache performance that nullifies the benefits of the optimizations. In fact, the optimizations could be working against you at smaller sizes. You can try using the other version of cc3d::connected_components as they perform additional (fast) passes on the image. No guarantee you'll see a significant benefit, but maybe worth a try.

uint16_t* out = cc3d::connected_components3d<short, uint16_t>(labels, xSize, ySize, zSize, /*max_labels=*/xSize*ySize*zSize, /*connectivity=*/6, /*out_labels=*/NULL);

This has gotten me down to about 20 ms on average!

I've also taken the suggestion about pre-allocating the out_labels array and passing it in, but right now I'm focused on getting the first pass to be as quick as possible :)

william-silversmith commented 2 years ago

Yay! I'm glad that helped. Here's one more thing you can try:

You can adjust downwards max_labels if you have some idea of how many provisional labels are needed. A random noise dataset will need it to be xSize x ySize x zSize, but usually you need many fewer. Try cutting it by 2-10x. This will reduce the allocation of the union-find datastructure. If you go too low (e.g. any reduction for a sequence of increasing integer labels), the algorithm will crash in a data-dependent fashion. The automatic pass you disabled makes a safe estimation of how much to allocate, but it was too expensive for your use case.

seung-lab / connected-components-3d

Parallel Operation #6