icesat2py / icepyx

Python tools for obtaining and working with ICESat-2 data
https://icepyx.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
215 stars 107 forks source link

Loading data groups `ipx.Read.load` can fail on certain files in edge cases #576

Open jrenrut opened 2 months ago

jrenrut commented 2 months ago

Summary

I've found an edge case where ipx.Read fails to load a certain file missing a requested variable group.

Description

Background

I used ipx.Query to find data for a limited spatial extend for the entire year of 2023. I then ordered the granules:

query.order_vars.append(['h_li', 'latitude', 'longitude'])
query.subsetparams(Coverage=query.order_vars.wanted)
query.order_granules()
query.download_granules('./data')

I isolate the specific file that's missing one of my desired variable group, 'land_icesegments'. For a good_ file:

h5 = h5py.File(r"data\processed_ATL06_20230108204519_02951802_006_02.h5")
gt1l = hf["gt1l"]
gt1l.keys()
>>> <KeysViewHDF5 ['land_ice_segments', 'residual_histogram', 'segment_quality']

For the misbehaving file:

h5 = h5py.File(r"data\processed_ATL06_20230111085249_03331806_006_02.h5")
gt1l = hf["gt1l"]
gt1l.keys()
>>> <KeysViewHDF5 ['residual_histogram', 'segment_quality']

Error

When I go to load my variables, I run into an OSError:

data_path = Path('./data').resolve()
data_files = tuple(data_path.glob('*2023*.h5'))
reader = ipx.Read([str(df) for df in data_files])
reader.vars.append(var_list=['h_li', "latitude", "longitude"])
ds = reader.load()
>>> OSError: [Errno group not found: land_ice_segments] 'land_ice_segments'

The error originates in Read.load() on this line:

all_dss.append(
    self._build_single_file_dataset(file, groups_list)
)  # wanted_groups, vgrp.keys()))

In Read._build_single_file_dataset() on this line:

ds = self._read_single_grp(file, grp_path=wanted_groups_list[0])

Finally the error is thrown here in Read._read_single_grp(), calling xarray.open_dataset):

return xr.open_dataset(
    file,
    group=grp_path,
    engine="h5netcdf",
    backend_kwargs={"phony_dims": "access"},
)

Proposed Solution

Place the call of Read._build_single_file_dataset() in a try/except:

try:
    all_dss.append(
        self._build_single_file_dataset(file, groups_list)
    )  # wanted_groups, vgrp.keys()))
except OSError:
    warnings.warn(
        f"{file} is excluded from this dataset because it"
        "did not contain all wanted variable groups.",
        stacklevel=2,
    )
JessicaS11 commented 2 months ago

Thanks for the great bug report! Out of curiosity - any idea why that granule doesn't contain the desired variable group? In cases like this we like to make sure that it's an edge case that should be handled within icepyx rather than an issue with the data itself.

jrenrut commented 2 months ago

This is my first time interacting with ICESat-2 data, so unfortunately I can't really speak to that but I was wondering about this also. Was there maybe no valid data in this granule, and does that result in the variable group being removed rather than left empty? Not sure.

mikala-nsidc commented 2 months ago

@jrenrut Could you share the bounding box that you used for your spatial query? I'll try to mimic the spatial and variable subsetting that you did to see why that group might have been missing from the processed file.

jrenrut commented 2 months ago

@mikala-nsidc Sure. I couldn't upload a .kml so I left it out originally, but it's easy enough to copy/paste:

<kml xmlns="http://www.opengis.net/kml/2.2"><Document>
<Placemark>

  <Polygon>
<outerBoundaryIs>
  <LinearRing><coordinates>-108.7448855153659,42.52137511207607
-108.95629660653259,42.478405856164756
-109.42207743882035,42.67607422948224
-109.69309543424377,42.9411752885619
-110.02073789421323,43.19898442578465
-110.00903619011682,43.3097777679274
-109.90707513545527,43.38036843185466
-109.66266574886284,43.491721566096665
-109.42482929506421,43.33300738018947
-109.0565174766684,42.93568510669404
-108.88773084723528,42.74920648517508
-108.7448855153659,42.52137511207607</coordinates></LinearRing></outerBoundaryIs></Polygon></Placemark></Document></kml>
jrenrut commented 2 months ago

and then:

shape_file = Path('data/wind.kml')
query = ipx.Query(
    product='ATL06',
    spatial_extent=str(shape_file),  ## icepyx doesn't like Path objects
    date_range=[start_date, end_date]
)
mikala-nsidc commented 2 months ago
Screenshot 2024-08-22 at 4 34 13 PM

So, there aren't data inside your polygon for that granule. But I'm not sure why you got a processed file back at all. I repeated your workflow in Earthdata Search and was able to replicate your results. I also got a file without any land ice segments in it. I would expect the subsetter processing to fail in this case, and give you an error that there are no data within the subset boundaries, and I'm not sure why it didn't fail. We'll have to investigate a bit on our end (NSIDC). I'll let @JessicaS11 weigh in about how/whether to handle a case like this within icepyx.

JessicaS11 commented 1 month ago

We'll have to investigate a bit on our end (NSIDC). I'll let @JessicaS11 weigh in about how/whether to handle a case like this within icepyx.

Thanks for looking into this @mikala-nsidc! Obviously it would be best to handle the cause of this error at the source (why the subsetter is returning any data). However, it would also be great to have some sort of error handling built in to keep read-in workflows from failing... follow the conversation on this aspect over in the PR (#578).