hainegroup / oceanspy

A Python package to facilitate ocean model data analysis and visualization.
https://oceanspy.readthedocs.io
MIT License
101 stars 32 forks source link

_check_range adds a significant and unnecessary overhead to cutout with LLC grids #386

Closed Mikejmnez closed 1 year ago

Mikejmnez commented 1 year ago

The following evaluations at the beginning of cutout with LLC4320 could be circumvented:

YRange = _check_range(od, YRange_in, "YRange")
XRange = _check_range(od, XRange_in, "XRange")

where XRange_in, YRange_in are input arguments into cutout. These two lines are necessary checks in most datasets, but are really unnecessary for datasets like ECCO, LLC4320 and even DYAMOND because their range is the entire world.

In particular, I found that the evaluations are sensitive to chunksize. For example consider the following chunksize strategies:

chunk1 = {'Xp1': 4320, 'Yp1': 4320, 'face':1, 'Z':1}

chunk2 = {'Xp1': 720, 'Yp1': 720, 'face':1, 'Z':90}

I timed the evaluations with LLC4320 data within Sciserver and I get the following results:

Chunk1:

CPU times: user 1.82 s, sys: 304 ms, total: 2.13 s
Wall time: 9.08 s

Chunk2:

CPU times: user 17.8 s, sys: 2.28 s, total: 20.1 s
Wall time: 57.4 s

solution:

We can define a parameter within the intake catalog that contains the (pre computed) max and min values of XG and YG Ranges (max and min) and have these be defined within od.parameters.

Then allow _check_range (which takes the od as argument) tpo check if these are predefined. If not the compute them. The relevant change would be somewhere here:

        prefs = ["Y", "X", "Z", "time"]
        coords = ["YG", "XG", "Zp1", "time"]
        for _, (pref, coord) in enumerate(zip(prefs, coords)):
            if pref in objName:
                if predetermined vals not defiend:
                    valchek = od.parameters[values]
                else
                    valchek = od._ds[coord]
                break

Later the max/min values are

        maxcheck = valchek.max().values
        mincheck = valchek.min().values

An alternative is to check, when 'face' is a dimension of the dataset:

abs(XRange)<180 and abs(YRange)<90

But this conditional may not be appropriate for other datasets with 'face' as a dimension (e.g. ASTE)

Depending on chunk size, circumventing maxcheck and mincheck calculation from the grid is a speed up of up to 1 minute in LLC4320

asiddi24 commented 1 year ago

isn't that true for ECCO too which has a 'face' dimension? I would assume it would not really make a difference for that size of a dataset, though. Also, with ECCO it seems that the chunksize has not really caused any issues as of yet with the cutout, considering how it is chunked at the moment. I can see though, with LLC4320 it would create a huge speed up. This looks like a great catch to increase compute speeds!

Mikejmnez commented 1 year ago

Yes, the ECCO dataset has very little sensitivity in terms of performance, to this check. LLC4320 with chunks1 (above), which is the original chunking (dataset accessible through sciserver) also is not as sensitive (because chunks are large) but chunk2 above, which has much better performance on the ceph cluster (when allowing full depth per chunk), is very sensitive to this (unnecessary) evaluation, as expected since there are more horizontal smaller chunks.

Mikejmnez commented 1 year ago

closed by pr #387