Further Speeding Up Routing Weight Calculations

From #16, @wrightky said:

So, looking at this code, I see that the main structure of the weight computation hasn't changed. We're still constructing small sub-arrays at each index, doing a few quick operations (max, cleaning, multiplications, summing), and saving the 9 resulting weights in a weight array. Surprised we hadn't tried this sooner given how much better it performs.

It looks like the key reason the runtime scales better in this model isn't anything about how this computation works, it's how many times we do it. Originally, we performed these operations locally once for every particle at every iteration. So, the runtime scaled as Np_tracer iterations_per_particle. Now, we perform this once for every cell, so it scales with domain size (L-2) (W-2). I bet if you checked the example cases you benchmarked, the ratio of these values should give roughly the speedup you observed.

One thing I wonder, though, is whether we could obtain even faster runtimes by modifying the structure of this computation itself, by switching from local array operations to global. Whatever is causing this function to take so long must be due to the fact that we're repeatedly constructing many sub-arrays in a loop and operating on them, instead of performing a few big global array operations (inside which we'd be taking advantage of all the fast numpy broadcasting stuff). In principal, if we broke up these operations (max, cleaning, multiplication, summing) into a loop over each of the D8 directions, with each being a global matrix operation, instead of a loop over each cell, we could reduce the amount of overhead repeatedly calling these functions. I don't know exactly how that scales, but it'd be the difference between time(np.max(small array)) (L-2) (W-2) and time(np.max(big array)) * 9. Does that scale better? Not sure.

So that is a potential route for further speeding up the routing weight calculation.

passaH2O / dorado

Further Speeding Up Routing Weight Calculations #22