Use more efficient memory layout

JiaweiZhuang commented 6 years ago

First thanks for this clear tutorial! It helped a lot when I first started to write my xESMF package.

In your current example, a normal numpy array (C-ordered) is passed to ESMPy (expects Fortran-ordered), so it incurs overhead for rearranging the memory layout. It is much better to pass the transpose of an array (changes from C-ordered to Fortran-ordered) to ESMPy. This would make passing arrays ~10x faster. See ESMPy_memory_layout.ipynb for more explanations and timing.

In my xESMF timing test, passing the transpose (Fortran-ordered) is 20x faster than passing a normal C-ordered array.

Using transpose also allows periodic boundary conditions to work correctly. ESMPy's periodic boundary option requires a shape of [Nlon, Nlat], but a typical numpy array has the shape [Nlat, Nlon]. See ESMPy_periodic_fix_ordering.ipynb

nawendt commented 6 years ago

In general, I like this suggestion. After looking at this for my small examples, I was able to get about 1.4x speedup. The difference is most likely due to my small grid sizes compared to your tests. While this works well when you are dealing with the entire array at a time, this did not hold up when I was modifying my MPI regrid example. What happens there is that you end up working with subsets of the array on each process and those subset arrays are neither C_CONTIGUOUS nor F_CONTIGUOUS. Of course, you can use numpy.asfortranarray to alleviate the memory order problem, but then the overhead from numpy.asfortranarray actually makes the timing worse than just passing it the unaltered array subset.

If I can implement this (in the serial case) without sacrificing the ability for the average user to understand the process, then I am all for pointing out ways to be more efficient. If it ends up reducing clarity, I'll just make note of the efficiency issue, link to your examples, and allow the intrepid user to make the necessary changes on their own. The MPI example looks like it would require some more thought if the same benefits are to be realized.

Lastly, as a fan of xarray, I like the application to ESMF.

nawendt commented 6 years ago

I took another look at the MPI example and figured out what was going on. The array bounds for each process, when used to subset your data, just give you a view of that numpy.array. What the numpy.asfortranarray function was doing was returning a copy, which is where the overhead came from. What I ended up doing was just transposing the arrays as they were read into memory and transpose back after any ESMPy interpolation was done. This gave roughly the same speedup as I saw in the tutorial notebook.

I have opted to place a toggle in the notebook and in the MPI script just to make it easy to see things done both ways and it allowed for the code to remain easy to understand. Appreciate the suggestion.

nawendt / esmpy-tutorial

Use more efficient memory layout #4