ipx.Read.load() not working with ATL09 beam profile names

DAndrewA commented 1 year ago

Summary

The ipx.Read.load() method seems to expect profiles that match the regular expression form gt[1-3]['l','r']. ATL09 data is labelled according to profiles that match the regular expression profile_[1-3]/['high','low']_rate. This means that loading in variables from ATL09 data files isn't currently possible.

Replication

I'm trying to load in ATL09 data for analysing clouds with. I've used the subsetting of downloaded granules, which can be replicated with the below code:

short_name = 'ATL09'

# setup a bounding box spatial region.
summitLoc = np.array([-38.45867106661449, 72.5813291830755]*2)
# tolerance of the bbox in degrees
#tol = np.array([0.1,0.1]*2)
tol = np.array([8,8]*2) # get a box thats a little bit bigger...
tol = tol * [-1,-1,1,1]
bbox = summitLoc + tol
print(bbox) # [-46.45867107  64.58132918 -30.45867107  80.58132918]

# RGTs
rgt = 749 # closest pass

# cycles
cycles = [10,11,12] # should get 3 files per RGT in this instance

region = ipx.Query(short_name,bbox, cycles=cycles, tracks=rgt)

I was then able to use .order_granules and .download_granules to download the data. I've accessed the data using the h5py module, and to the best of my knowledge, none of the data appears to be erroneous or missing.

I wrote a simple script to test the loading functionality of icepyx, which goes as:

import icepyx as ipx

# define where to find the files, etc
datafolder = 'path/to/the/data/'
files_list = os.listdir(datafolder)
print(files_list) # ['processed_ATL09_20210211004659_07491001_005_01.h5', 'processed_ATL09_20210512202651_07491101_005_01.h5', 'processed_ATL09_20210811160643_07491201_005_01.h5']

pattern = "processed_ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"
product = 'ATL09'

reader = ipx.Read(data_source=datafolder, product=product, filename_pattern=pattern)
# outputs: You have 3 files matching the filename pattern to be read in.

var_list = ['latitude','longitude']
reader.vars.append(var_list=var_list)
print(reader.vars.wanted)
# {'sc_orient': ['orbit_info/sc_orient'], 'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'], 'cycle_number': ['orbit_info/cycle_number'], 'rgt': ['orbit_info/rgt'], 'data_start_utc': ['ancillary_data/data_start_utc'], 'data_end_utc': ['ancillary_data/data_end_utc'], 'latitude': ['profile_1/high_rate/latitude', 'profile_1/low_rate/latitude', 'profile_2/high_rate/latitude', 'profile_2/low_rate/latitude', 'profile_3/high_rate/latitude', 'profile_3/low_rate/latitude'], 'longitude': ['profile_1/high_rate/longitude', 'profile_1/low_rate/longitude', 'profile_2/high_rate/longitude', 'profile_2/low_rate/longitude', 'profile_3/high_rate/longitude', 'profile_3/low_rate/longitude']}

ds = reader.load()

This produces an error message that reads as

AttributeError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 ds = reader.load()

File ~/.conda/envs/icesat_summit/lib/python3.10/site-packages/icepyx/core/read.py:545, in Read.load(self)
    538 # DevNote: I'd originally hoped to rely on intake-xarray in order to not have to iterate through the files myself,
    539 # by providing a generalized url/source in building the catalog.
    540 # However, this led to errors when I tried to combine two identical datasets because the single dimension was equal.
    541 # In these situations, xarray recommends manually controlling the merge/concat process yourself.
    542 # While unlikely to be a broad issue, I've heard of multiple matching timestamps causing issues for combining multiple IS2 datasets.
    543 for file in self._filelist:
    544     all_dss.append(
--> 545         self._build_single_file_dataset(file, groups_list)
    546     )  # wanted_groups, vgrp.keys()))
    548 if len(all_dss) == 1:
    549     return all_dss[0]

File ~/.conda/envs/icesat_summit/lib/python3.10/site-packages/icepyx/core/read.py:686, in Read._build_single_file_dataset(self, file, groups_list)
    684 wanted_groups_list = wanted_groups_list[1:]
    685 ds = self._read_single_grp(file, grp_path)
--> 686 is2ds, ds = Read._add_vars_to_ds(
    687     is2ds, ds, grp_path, wanted_groups_tiered, wanted_dict
    688 )
    690 # if there are any deeper nested variables, get those so they have actual coordinates and add them
...
--> 414     gt_str = re.match(r"gt[1-3]['r','l']", grp_path).group()
    415     spot = is2ref.gt2spot(gt_str, is2ds.sc_orient.values[0])
    416     # add a test for the new function (called here)!

AttributeError: 'NoneType' object has no attribute 'group'

I've identified that the error comes from the regular expression matching. In the ATL09 files, the data is labelled according to the regular expression format profile_[1-3]/['high','low']_rate rather than gt[1-3]['l','r']. I tried changing this in the function, so that it read:

# (line 413) in icepyx/core/read.py
            import re
            gt_str = re.match(r"profile_[1-3]", grp_path).group()
            '''ORIGINAL CODE, DO NOT DELETE
            gt_str = re.match(r"gt[1-3]['r','l']", grp_path).group()
            '''
            spot = is2ref.gt2spot(gt_str, is2ds.sc_orient.values[0])

However, this simply throws an error of the form


AssertionError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 ds = reader.load()

File ~/.conda/envs/icesat_summit/lib/python3.10/site-packages/icepyx/core/read.py:548, in Read.load(self)
    541 # DevNote: I'd originally hoped to rely on intake-xarray in order to not have to iterate through the files myself,
    542 # by providing a generalized url/source in building the catalog.
    543 # However, this led to errors when I tried to combine two identical datasets because the single dimension was equal.
    544 # In these situations, xarray recommends manually controlling the merge/concat process yourself.
    545 # While unlikely to be a broad issue, I've heard of multiple matching timestamps causing issues for combining multiple IS2 datasets.
    546 for file in self._filelist:
    547     all_dss.append(
--> 548         self._build_single_file_dataset(file, groups_list)
    549     )  # wanted_groups, vgrp.keys()))
    551 if len(all_dss) == 1:
    552     return all_dss[0]

File ~/.conda/envs/icesat_summit/lib/python3.10/site-packages/icepyx/core/read.py:689, in Read._build_single_file_dataset(self, file, groups_list)
    687 wanted_groups_list = wanted_groups_list[1:]
    688 ds = self._read_single_grp(file, grp_path)
--> 689 is2ds, ds = Read._add_vars_to_ds(
    690     is2ds, ds, grp_path, wanted_groups_tiered, wanted_dict
    691 )
    693 # if there are any deeper nested variables, get those so they have actual coordinates and add them
...
    276     ], "An invalid ground track was found"
    278     gr_num = np.uint8(gt[2])
    279     gr_lr = gt[3]

AssertionError: An invalid ground track was found

DAndrewA commented 1 year ago

I've made some further changes to the code in ipx.Read and have been able to succesfully load in an ATL09 file to an xarray database.

The code I've changed within the file is:

# line 407, in function _add_vars_to_ds
        else:
            import re
            print(f'NEW DEBUG: ipx.READ: _add_vars_to_ds: {grp_path=}')
            '''ORIGINAL CODE, DO NOT DELETE
            gt_str = re.match(r"gt[1-3]['r','l']", grp_path).group()
            spot = is2ref.gt2spot(gt_str, is2ds.sc_orient.values[0])
            '''
            profile_str = re.match(r'profile_[1-3]', grp_path).group()
            #rate_str = re.match(r"*['high','low']_rate", grp_path).group()
            print(f'NEW DEBUG: ipx.READ: _add_vars_to_ds: {profile_str=}')
            #print(f'NEW DEBUG: ipx.READ: _add_vars_to_ds: {rate_str=}')
            gt_str=profile_str
            spot = int(profile_str[-1])
            # add a test for the new function (called here)!

            grp_spec_vars = [...

By completely bypassing the function is2ref.gt2spot() I've been able to get a database that extracts the values I'm looking for. The output database has the structure:

It still uses the gt and spot notation used in the ATL06 data formats, but the data itself seems to be valid.

My solution as-is would be incompatible with the ATL09 data (and possibly other data products).

JessicaS11 commented 1 year ago

Hello @DAndrewA! Thank you very much for this great bug report! We have also encountered this issue with ATL11, but I'm struggling to keep up with bug fixes and new development simultaneously, so the implementing a fix has been slow. There are currently the first portions of a fix for this particular issue in the atl11 branch, which is quite similar to the solution you shared (but tries to generalize to work for spot, profile, and pt cases). I'll be continuing to work on this later this week and next and am happy to ping you when we've got an updated release out. Would you consider reviewing a pull request with this fix in it? Thanks!

DAndrewA commented 1 year ago

Hi @JessicaS11 , thanks for getting back to me! I've never reviewed a pull request before but I'd be more than happy to give it a go!

JessicaS11 commented 1 year ago

I've never reviewed a pull request before but I'd be more than happy to give it a go!

@DAndrewA Thanks for this offer! If you're new to the process or would like some help getting oriented around the GitHub pull request (PR) interface/pulling the branch to test locally, I'd be happy to arrange a call to get you going. For some high level info on reviewing code and why it's so critical and helpful, The Turing Way has some great resources. You can also check out some of the already merged PRs in this repo. In the meantime, I will send you an invite (which you will have to accept) to the icesat2py organization that icepyx lives in so that you have the right permissions to review.

icesat2py / icepyx

ipx.Read.load() not working with ATL09 beam profile names #397

Summary

Replication