icesat2py / icepyx

Python tools for obtaining and working with ICESat-2 data
https://icepyx.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
201 stars 100 forks source link

improve netcdf multi-group reading and handle metadata QC issue #516

Closed JessicaS11 closed 1 month ago

JessicaS11 commented 4 months ago

Per this discussion, ATL11 v006, ATL14 v003, and ATL15 v003 have metadata issues that cause icepyx to fail when it tries to get the version number from the granule file. This PR issues a temporary fix that instead gets the most recent version number from CMR to use instead.

It also adds handling reading in multiple groups of a netcdf.

github-actions[bot] commented 4 months ago

Binder :point_left: Launch a binder notebook on this branch for commit 864ebb2bc28e721f8a458e72b8d56695cd12526b

I will automatically update this comment whenever this PR is modified

Binder :point_left: Launch a binder notebook on this branch for commit 6c22caa424a1887bec665b09ea30c49c20da2e15

Binder :point_left: Launch a binder notebook on this branch for commit 714642f4e78da8823387bbdde067f3630c7be026

Binder :point_left: Launch a binder notebook on this branch for commit 3fcddadc4594f61ad40aae864bafe7d2c4561832

Binder :point_left: Launch a binder notebook on this branch for commit ec3680bd30d69f6a6f2aca3f24a8f1d6fa1603a4

Binder :point_left: Launch a binder notebook on this branch for commit 8ec1f5526106c2a2a74a3359a0f50bb816a000a9

Binder :point_left: Launch a binder notebook on this branch for commit c9d64fba97a896df13b16872245c8cb40278b1ed

Binder :point_left: Launch a binder notebook on this branch for commit 3e172517e4773d97c7e92464bcef0755a1375be0

Binder :point_left: Launch a binder notebook on this branch for commit 4efbaeb0c9518fe821cc0e9817838e6584b16a31

Binder :point_left: Launch a binder notebook on this branch for commit 7e1f50601c382496db074fccdb027e519051dbea

Binder :point_left: Launch a binder notebook on this branch for commit 23ff7ad28f8d86b64d6aa274834ec3025c9a43dc

Binder :point_left: Launch a binder notebook on this branch for commit 87ee63abfa2abe60866722166d0ed18c48f7afe1

Binder :point_left: Launch a binder notebook on this branch for commit b777a7fade5367c38ab0de9f13d104c306136fbe

Binder :point_left: Launch a binder notebook on this branch for commit 1111e80a3779390834e3a026c87415408db7ec75

Binder :point_left: Launch a binder notebook on this branch for commit ce5475760ee269b861a2171364bbe19edc8b738a

Binder :point_left: Launch a binder notebook on this branch for commit d23d1a09146a65f15367f4a138f2b760a93c00e0

review-notebook-app[bot] commented 4 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

JessicaS11 commented 4 months ago

@rwegener2 @betolink @andypbarrett @scottyhq @wsauthoff @tsutterley @jpswinski Before reporting to xarray, I'm hoping one of you can check I'm not doing something silly. Per this toy example (also in the metadata branch that this PR is for, but easier to view not as a GitHub diff), when I read an ATL15 file in using Xarray (xr_open_dataset()) in the cloud versus locally, the resulting structure of the dataset is different. Namely, in the cloud the variables include the time dimension, and locally they do not. I initially noted it via the icepyx.Read functionality (bottom half of the notebook), but reproduced it with xarray directly (top half of the notebook). I've tried controlling for versions and defaults, but not yet with a fine tooth comb. I know suspect the number of x and y values differ due to the granule being ordered with subsetting (I thought I set it to false, but...), but that shouldn't make a difference anyway. Have any of you seen this?

tsutterley commented 4 months ago

hey @JessicaS11, is ATL15_GL_0318_01km_003_01_HEGOUT.nc a renamed version of the ATL15 granule? Or is a subset file from the NSIDC API? I think NSIDC is still working on 3D raster subsetting capability.

JessicaS11 commented 4 months ago

hey @JessicaS11, is ATL15_GL_0318_01km_003_01_HEGOUT.nc a renamed version of the ATL15 granule? Or is a subset file from the NSIDC API? I think NSIDC is still working on 3D raster subsetting capability.

That's what came back when I ordered and downloaded from NSIDC:

region = ipx.Query(
         product="ATL15",
         spatial_extent=[-17.25, 80.7, -16.0, 81.0],  # minlon, minlat, maxlon, maxlat
         date_range=['2018-09-15', '2023-03-02']
     )

     region.download_granules(path="./doc/source/example_notebooks/atl15",
                              subset=False)

I think NSIDC is still working on 3D raster subsetting capability.

That's also what I thought... my naive (and "potentially an issue" assumption) was that even if the file were being subset spatially, it shouldn't be affecting the structure and thus how xarray reads in a specific group.

betolink commented 4 months ago

just to confirm, @JessicaS11 the local granule was downloaded using icepyx?

if it was downloaded using the on-prem subsetting service maybe there is a bug in EGI (or an undocumented behavior) that removed the time dimension even if we are not subsetting the granule.

JessicaS11 commented 4 months ago

just to confirm, @JessicaS11 the local granule was downloaded using icepyx?

Yes - with subsetting applied. (So subsetting on rasters is available now, @tsutterley!)

Submitting to EGI without subsetting params returned ftp downloadUrls (which icepyx isn't set up to handle, since it looks for the order ID). To initiate the unprocessed granule download (via a url request in the browser), I had to set agent=NO and mode=sync, at which point I assume it's not going through EGI. Opening the full granule with xarray behaved just as on the cloud, with the variables correctly formatted with x, y and time dimensions.

if it was downloaded using the on-prem subsetting service maybe there is a bug in EGI (or an undocumented behavior) that removed the time dimension even if we are not subsetting the granule.

@mikala-nsidc, @betolink suggested you might know what's going on (or someone who does) such that somehow EGI is restructuring the data when it's subsetted to turn the time dimension into bands (as interpreted by xarray).

mikala-nsidc commented 4 months ago

@JessicaS11 @betolink @tsutterley The way that the on prem subsetter handles ATL15 parameters that are 3 dimensional arrays is by "flattening" those bands. This means, for instance, that if you select a spatial subset of an ATL15 file, the time dimension in dh/dt will be separated into 18 bands. Attaching a screenshot to illustrate.
Screenshot 2024-03-07 at 10 25 34 AM Spatial subsetting will be handled differently once those services are available in the cloud. The code will be developed to better handle 3D arrays. The flattening method was our only option for the on prem subsetter.

JessicaS11 commented 4 months ago

Thanks, @mikala-nsidc! It's good to know that's intended behavior. Is there a verifiable mapping of bands to timestamps? I don't see anything in the metadata for a given band to indicate which timestamp it corresponds to.

JessicaS11 commented 3 months ago

@wsauthoff @mikala-nsidc Are either of you available to review this PR?

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 13.63636% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 66.27%. Comparing base (cc02758) to head (d23d1a0).

Files Patch % Lines
icepyx/core/read.py 0.00% 15 Missing :warning:
icepyx/core/is2ref.py 20.00% 4 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## development #516 +/- ## =============================================== - Coverage 66.59% 66.27% -0.32% =============================================== Files 36 36 Lines 3059 3072 +13 Branches 534 537 +3 =============================================== - Hits 2037 2036 -1 - Misses 934 948 +14 Partials 88 88 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.