hainegroup / oceanspy

A Python package to facilitate ocean model data analysis and visualization.
https://oceanspy.readthedocs.io
MIT License
101 stars 32 forks source link

Iss386 #387

Closed Mikejmnez closed 1 year ago

Mikejmnez commented 1 year ago

This PR addresses #386 .

Improves the performance of oceanspy with LLC4320 by avoiding the computation of max/min range values from the dataset, as these are known a priori.

Feel free to suggest something different, not approve, etc...

This PR also makes sure that the chunk size in the horizontal of the resulting dataset from cutout, is set by the domain size. Before it was an option, with the idea for testing. But we can always re chunk afterwards. This eliminates an unused parameter from llc_rearrange.

codecov[bot] commented 1 year ago

Codecov Report

Merging #387 (7ba13f0) into main (4eaa06e) will increase coverage by 0.05%. The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main     #387      +/-   ##
==========================================
+ Coverage   94.10%   94.15%   +0.05%     
==========================================
  Files           9        9              
  Lines        4189     4194       +5     
  Branches     1005     1004       -1     
==========================================
+ Hits         3942     3949       +7     
  Misses        154      154              
+ Partials       93       91       -2     
Flag Coverage Δ
unittests 94.15% <75.00%> (+0.05%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
oceanspy/subsample.py 92.62% <ø> (ø)
oceanspy/llc_rearrange.py 84.77% <50.00%> (+0.25%) :arrow_up:
oceanspy/_ospy_utils.py 98.24% <100.00%> (+0.05%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

MaceKuailv commented 1 year ago

why is chunks removed?

Mikejmnez commented 1 year ago

why is chunks removed?

Reasons:

1) It is very virtually impossible to know a priori the size of the resulting dataset after an arbitrary cutout, in particular the length of core dimensions (X, Y, etc) 2) Because of 1), it is essentially impossible to know a priori whether a "proposed" chunking strategy (passed as argument to llc_rearrange) will create equal, consistent chunks across the dataset. A consistent chunking strategy creates equal chunks / divides the length of a core dimension into equal (number of) pieces. Because center and corner points (Xp1 and X) are offset by a unit, this resorts to finding the largest common divisor (because we want large chunks), or close to equal chunk sizes so that we always have the same number of chunks ("Nx"), each with the same number of grid points. 3) Performing an arbitrary cutout can sometimes create inconsistent chunking along core dimensions. There is no immediate error associated with this, but it can creep later when making computations (not when plotting). This means that the resulting dataset after cutout should always be rechunked. 4) Having inconsistent chunking can negatively effect the performance of subsequent operations, and lead to errors. I can come up with an explicit example if needed.

And so the only option that is guaranteed to produce consistent chunking after a cutout via llc_rearrange, is by having a single chunk of the size of the core dimension (e.g. X, 'Xp1', 'Y', 'Yp1', ...). Any subsequent re chunking can be done afterwards, after inspecting the dataset. Having chunks as arguments will very likely lead to errors.

Mikejmnez commented 1 year ago

Looks good to me. Do these changes work OK with ECCO and daily_ecco LLC datasets too?

Yes. The changes are independent of the size of the LLC dataset.

ThomasHaine commented 1 year ago

@asiddi24 @MaceKuailv @malmans2 @renskegelderloos Does this look OK to you? If so, please approve the PR. Thanks!