How to reduce the cost of convolution computation

Yizhaofeng commented 3 months ago

Hi, it's me agian. I got a list of filter scales, ranging from 3km to 150km. With the increase of filter scale, the area of convolution calculation increases, and so does the calculation time. If I submit the input file (24hour in one file, 31 files in total) again and again for calculation, the local kernel will be repeated each time, resulting in an increase in calculation cost. Is there any advice on this issue? I would appreciate any help.

Yizhaofeng commented 3 months ago

There are plenty of articles disscussing this issue. Applying FFT for the calculation could be optimal in coarse-graining？I have considered using a sparse matrix to store the kernel values and reading the sparse matrix directly at each computation, which will not need to repeat the local kernel.

bastorer commented 3 months ago

Hello :-)

Sorry, this response got far longer than I had first planned. The short version is: unfortunately there aren't really many great options for reducing the cost of computing the kernel, and all of the problems essentially come back to "working on a sphere makes everything harder". Internally, the kernel is only computed once per time, and then is applied to all times before moving on to the next point in space, so that the most efficient path might be to merge your outputs into one file and process those as one whole, but I'm not sure if that's feasible for your set-up or not.

You're correct that the computational time increases (quadratically) as the filter scale increases. Unfortunately, applying FFT's doesn't really solve the situation, since you would either i) need to be working in a Cartesian setting to use FFTs or ii) if you wanted to generalize to use spherical harmonics, you have an even more expensive calculation, since computing the coefficients of the harmonics requires a domain integral for each each mode [efficient spherical harmonic techniques rely on having specific grids that are optimized for exactly that purpose].

So, what can we do to make things faster? In FlowSieve, we actually do re-use the kernel, especially if the longitude grid is uniform. If the longitude spacing is uniform, then we only actually need to compute the kernel once per latitude, and can re-use it at all longitudes (by simply re-indexing, at no cost!). That reduces the number of calls to kernel calculations from N^2 to N, so while it doesn't change the quadratic scaling, it does substantially reduce the runtime by heavily reducing the total number of kernel calculations.

Depending on what you mean by storing in a sparse matrix, that has other drawbacks, unfortunately. Storing the kernel itself isn't necessarily super useful for more general cost savings (especially if you want to reuse things between filter scales). What would be useful to save, though, is the distance between points, especially when you consider that most of the runtime cost of computing the kernel is actually in computing distances. The problem here, however, is memory. Even if stored sparsely, needing to store all of the distances within an ell^2 area for each grid point very quickly. For instance, on a 1/12-degree grid we have ~1e7 grid points. On a 150km kernel, that's a minimum of 275 points per kernel (far more at the poles). At double precision, you would need ~20GB to store the distances. And that grows quadratically with ell, so that 300km would need 80GB, and so on. Worse, it grows quartically with resolution, so that doubling the resolution causes a 16-fold increase in requirements to store distances (4x number of points, each having distances to 4x more points), so that 1/24-degree at 150km would need 320GB of memory. While those aren't necessarily burdensome sizes for storing on disk, they are for reading into active memory, and lazy-reading [only reading as you need] open some I/O and file-locking problems.

Internally, the kernel is only computed once per time, and then is applied to all times before moving on to the next point in space, so that the most efficient path might be to merge your outputs into one file and process those as one whole.

If you have other thoughts on ways to optimize the filtering routines, I would love to hear them! I've spent a fair deal of time trying to wrangle the problem, but beyond reusing an existing kernel as much as possible within a giving filter scale to minimize the number of times that we actually need to compute the kernels, I haven't found many feasible paths.

There are some tricks that I'm looking to test involving unstructured grids and only recomputing halos, but I suspect that will still be less efficient than the 'reuse across all longitudes' trick, and is going to require some tricky coding haha.

Yizhaofeng commented 3 months ago

Thanks a lot :)

I knew the complexity is far beyond my imagination. I 've already paralleled 6 nodes, 384 cpus in all to deal with my project. My input file is 24 hours, 40 layers, 1920*696 horizontal grid points, if I choose filter scales = {3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 55 59 63 67 71 75 79 83 87 91 95 99 103 107 111 115 120 125 130 135 140 145 150} km, I would spend a whole day finishing this case. And this case is just for 1 day, I want more though. For me, that's costly TAT.

About "we only actually need to compute the kernel once per latitude". At one scale, I know kernel would be identical per latitude if the lon gird is uniform. I don't know if I'm getting it wrong, even though longitude grid spacing is uniform, we re-compute kernel in first lon to final lon in compute_local_kernel.cpp in which I did not find the code handling UNIFORM_LON_GRID to avoid re-computing the distance and kernel, so we actually re-compute kernel right at each longitude right? I maybe wrong.

About another paths to optimize the filtering routines, I talked about this issue with my fellows, some of them interest in deep learning. They said CUDA would boosting the convolution with the help of GPU. The probelm is, the whole filtering routines should re-write into Python version in pytorch frame. Once a time, I really thought about rewriting it. Is it really possible? If it is possible, give it a shot.

Thanks again! Helpful conversation 💯.

husseinaluie / FlowSieve

How to reduce the cost of convolution computation #45