Parallelization of computing-intensive functions

nmandery commented 5 years ago

I do not know if this an issue there is any interest in, but I noticed the library can potentially benefit from parallelizing parts of computing-intensive functions.

I experimented a bit and used OpenMP to simply parallelize the check if a hexagon is contained in a polygon in the polyfill function: https://github.com/nmandery/h3/commit/65afcb16579e833083feba713a89d7423611212a

Results without openmp:

$ ./bin/benchmarkPolyfill 
    -- polyfillSF: 5830.508514 microseconds per iteration (500 iterations)
    -- polyfillAlameda: 7501.964168 microseconds per iteration (500 iterations)
    -- polyfillSouthernExpansion: 325676.916500 microseconds per iteration (10 iterations)
$ ./bin/benchmarkPolyfill 
    -- polyfillSF: 5241.980240 microseconds per iteration (500 iterations)
    -- polyfillAlameda: 7528.802096 microseconds per iteration (500 iterations)
    -- polyfillSouthernExpansion: 326040.845100 microseconds per iteration (10 iterations)

Results with openmp:

$ ./bin/benchmarkPolyfill 
    -- polyfillSF: 2136.395920 microseconds per iteration (500 iterations)
    -- polyfillAlameda: 2469.901036 microseconds per iteration (500 iterations)
    -- polyfillSouthernExpansion: 98592.296500 microseconds per iteration (10 iterations)
$ ./bin/benchmarkPolyfill 
    -- polyfillSF: 2006.864104 microseconds per iteration (500 iterations)
    -- polyfillAlameda: 2428.976026 microseconds per iteration (500 iterations)
    -- polyfillSouthernExpansion: 97370.778000 microseconds per iteration (10 iterations)

These are the specs of the machine I used:

$ grep 'model name' /proc/cpuinfo 
model name  : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
model name  : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
model name  : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
model name  : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

I guess the usage of OpenMP should be optional, as I do not know how this would affect the emscripten based Javascript port of h3.

dfellis commented 5 years ago

Looks like a 2-3x speedup. Very cool. :)

@joegilley it looks like you have a target. ;)

@nmandery we have known that the current polyfill algorithm is suboptimal -- it was just the easiest to prove correct under all circumstances.

There was a much better polyfill algorithm implemented in a prototype a few years ago (by @joegilley :) ), and with @nrabinowitz having implemented a line algorithm a month or so ago, it should be possible to implement the better polyfill algorithm for most situations on top of it.

I will leave it up to @isaacbrodsky if H3 will have an optional dependency on OpenMP. Likely the current algorithm will remain a fallback so a speedup could be nice to keep around, but this breaks the current assumption that the library will only burn a single core, which can affect capacity planning.

Also, since polyfill was designed to make sure contiguous polygons do not share any H3 indexes, it is possible for the performance-minded to "carve up" the polygon into sub-polygons and execute polyfill on each of them in their threading library of choice and see a similar speedup, though that is not nearly as turnkey as this.

nrabinowitz commented 5 years ago

There was a much better polyfill algorithm implemented in a prototype a few years ago (by @joegilley :) ), and with @nrabinowitz having implemented a line algorithm a month or so ago, it should be possible to implement the better polyfill algorithm for most situations on top of it.

It's worth noting though:

It's not yet clear whether the new algo will work effectively outside a single base cell
It definitely won't work across > 2 base cells
And we still need the point-in-poly check regardless, because the old algo was only convex hull.

So this is probably still valuable, and I really like how pluggable it is. But I definitely think this needs to be a build-time option (not just dependent on environment, but on an explicit flag), and I'm fairly certain it won't work at all with the Emscripten-compiled code.

dfellis commented 5 years ago

There was a much better polyfill algorithm implemented in a prototype a few years ago (by @joegilley :) ), and with @nrabinowitz having implemented a line algorithm a month or so ago, it should be possible to implement the better polyfill algorithm for most situations on top of it.

It's worth noting though:
* It's not yet clear whether the new algo will work effectively outside a single base cell

* It definitely won't work across > 2 base cells

* And we still need the point-in-poly check regardless, because the old algo was only convex hull.
So this is probably still valuable, and I really like how pluggable it is. But I definitely think this needs to be a build-time option (not just dependent on environment, but on an explicit flag), and I'm fairly certain it won't work at all with the Emscripten-compiled code.

...Likely the current algorithm will remain a fallback...

I know @nrabinowitz :) I also think your proposal that it's a compile-time flag to include it or not is probably the right way to go, but it does present issues for the Java, Go, and Python bindings:

Which version of H3 should they include, or should they include both somehow and let the user choose?
What if users only want half of their cores potentially utilized, or n - 1 of their cores?
Should the Java and Go bindings not include it but reimplement the perf boost in the native threading capabilities of those languages?

When we go from 1 core basic C functions to a multicore approach, lots of these thorny questions pop up that very much depend on the use-case and resource constraints of the user.

nmandery commented 5 years ago

@dfellis OpenMP allows setting the number of threads C functions like omp_set_num_threads or even using the environment variable OPENMP_NUM_THREADS. There are also a few more variables to control its behavior. This whole thing is a bit special and should definitely be a compile time option.

I also do not know if it really makes sense to include into h3 itself, it is just a patch I am using which helps me for my niche usecase where I need polyfill a lot.

dfellis commented 5 years ago

@nmandery your use-case is totally valid and we should try to support it! I'm just trying to bring up as many issues and alternatives as possible so the final result covers as many use-cases as possible while ideally also causing us the least maintenance headaches. :)

isaacbrodsky commented 5 years ago

Hi @nmandery, thanks for posting this! Great to see the speed up.

I don't think we want to merge this into uber/h3 right now, under the assumption we want to keep the C sources as dependency-less, (although this is pretty self contained within the library.) and because of the binding issue (such as with Emscripten) mentioned above.

nmandery commented 5 years ago

@isaacbrodsky That is perfectly understandable and makes sense to keep number of dependencies the library as low as possible. My intention with this issue was just sharing my findings.

For my part I am fine with compiling H3 with this patch when I need it.

dfellis commented 5 years ago

@isaacbrodsky does that mean the focus will be on implementing the improved polyfill algo to get a similar speedup in a single core?

douglasg14b commented 3 years ago

I would caution that parallelization may be unwanted in some applications, and can interfere with the performance of these applications. If parallelization is introduced to some of these functions, it really must be optional.

uber / h3

Parallelization of computing-intensive functions #191