hainegroup / oceanspy

A Python package to facilitate ocean model data analysis and visualization.
https://oceanspy.readthedocs.io
MIT License
96 stars 32 forks source link

Allow serial evaluation of mooring_array when face is dimension #379

Closed Mikejmnez closed 8 months ago

Mikejmnez commented 1 year ago

Description of current behavior

Extracting a mooring via subsample.mooring_array, first calls subsample.cutout, for arbitrary datasets. In the case of ECCO or LLC4320, subsample.cutout is a necessary intermediate step that "removes" face as a dimension by "transforming" the surviving subdomain of the original dataset. By "transforming", I mean reducing the horizontal dimensional of the dataset from (face, Y, X) to (Y, X). Given the complex topology of the dataset (e.g., increasing the index value along the local X axis within the faces={7, 8, 9, 10, 11, 12} samples data that lies southward instead of Eastward), the "transformation" requires that a subset of the entire dataset (of arbitrary size) to be transposed and their "ordering" reversed (plus other label manipulations).

The transformation (removal of face) is quite fast for the ECCO dataset but can be very computationally expensive for larger datasets (e.g. DYAMOND or LLC4320). If only a 1d-like array is the final desired output, like in the case of subsample.mooring_array, then the "current order of operations" is a bit wasteful even though it is all done lazily. Such "expensive" feature can manifest itself by the creation of a large number of task graphs required in the intermediate step of transforming the data, which in many cases can overwhelm the system (kernel dies). Furthermore, the larger the extend of the array pathway in lat/lon space (i.e. spanning multiple faces) implies a larger dataset that first needs to be "transformed". Consider the example below:

"wasteful" example:

lon = [-157.58510459, -158.33354534, -157.58510459, -127.64747465,  -104.82003182,  -76.75350376,  -59.91358692,  -49.06119607, -49.06119607,  -40.82834783,   -2.28364929,   22.78911579, 49.73298273,   78.92217192,  102.52851624,  119.36843308, 120.11687383,  114.50356822,   90.55346427,   57.99629171, 30.30398402,  -17.22200351,  -41.17210746,  -53.14715943, -61.75422804,  -70.73551702,  -87.20121349,  -81.2136875, -84.20745049, -106.66067294, -131.35921764, -154.56088085, -157.92886421, -169.90391619, -167.28437357, -157.55464384, -157.55464384]

lat = [-49.92911347, -56.41804741, -66.9491002 , -65.4390917 ,  -66.9491002 , -64.48901164, -59.96493682, -59.77709812, -67.09518613, -73.04637258, -67.3847335 , -66.05394013, -63.50473081, -62.31193694, -61.26497563, -59.22272469, -52.52506278, -45.94075026, -41.34034696, -37.88205039, -37.58609529, -35.48152609, -43.54916399, -52.97806792,  -56.43483467, -57.85575306, -51.37185456, -39.92040865, -22.77515385,   1.29866914,  21.71572601,  20.66900717,-8.40068788, -25.50672615, -39.05399571, -42.45449235, -50.42784761]

Visually, these coordinates are:

complex_array

a cutout with this coordinates involves the faces [0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12] and results in the following cutout plot

Screen Shot 2023-07-05 at 7 00 45 PM

From the figure, about half of the ~ 300 X 180 (50869) grid points needs to be "transformed" (i.e. transposed and their ordering reversed), but only about 1000 points are retained by the final mooring array, i.e. len(od_moor.dataset.mooring.values), and only about half of those need to be "transformed". Including the vertical and time dimensions increases very rapidly the amount of data that needs to be "transformed" while the amount of data retained grows much much slowly.

Proposed optional behavior

Include a serialized approach to subsample.mooring_array, so that a mooring array within each face is first calculated without any subsample.cutout call, and then have these arrays combined within a single array along the new dimension mooring. This (serialization) feature will be an option, but in the case face is a dimension of the dataset, this should be the default.

In the example above, the figure below color codes data within a face along the mooring dimension

color_coded

Different colors along the mooring dimension can represent the same face, sampled "later" in the iterative (looped) process. This approach can then be delayed (parallelized).

This proposed approach also prevents the presence of NAN-ed data in the coordinate variables, a feature that began to produce failing tests as explained in #378 with the newer version of scipy.

Mikejmnez commented 8 months ago

closed by #399