icesat2py / icepyx

Python tools for obtaining and working with ICESat-2 data
https://icepyx.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
203 stars 101 forks source link

Variables as an independent class #451

Closed rwegener2 closed 8 months ago

rwegener2 commented 10 months ago

Goal

The ability to list the available variables and their h5 paths is a unique and powerful feature of icepyx. This PR aims to make that first-order functionality, instead of functionality that is only used internally. I think this will be a useful user feature, and I think it makes sense to make this change before adding the s3 url capability.

How it was done

We anticipate that the most common use case is reading variables from a file. Alternately, all you need to produce an .avail() list of variables is a product and version. This information is used to ping an API, which returns the variables.

When using Variables() the user is required to supply EITHER:

  1. A local filepath (variables are read directly from the file)
  2. the product

The version argument is optional. If it is not supplied the latest version is assumed. If both path and product is given an error is raised.

How it can be tested (updated)

from icepyx.core.variables import Variables

# initializing with a filepath
v = Variables('data/ATL06/processed_ATL06_20190226005526_09100205_006_02.h5')
v.avail()

# initializing with a product only
v = Variables(product='ATL03')
v.avail()
print(v.version)  # returns '006'

# initializing with a product and version
v = Variables(product='ATL03', version='004')
v.avail()
print(v.version)  # returns '004'

And testing that Read and Query still work:

from icepyx.core.read import Read
ds = Read('data/ATL06/processed_ATL06_20190226005526_09100205_006_02.h5')
ds.vars.append(beam_list=['gt1l'], var_list=['h_li', "latitude", "longitude"])
ds.load()
from icepyx.core.query import Query
from icepyx.core.read import Read

# Create Query object
short_name = 'ATL06'
spatial_extent = [-55, 68, -48, 71]
date_range = ['2019-02-20','2019-02-28']
region = Query(short_name, spatial_extent, date_range)

# order and download granules
region.order_vars.append(beam_list=['gt1l'], var_list=['h_li', "latitude", "longitude"])
region.order_granules(Coverage=region.order_vars.wanted)
path = './download'
region.download_granules(path)

# open the files to see they have the required variables included (ex. sc_orient, atlas_sdp_gps_epoch)
reader = Read(path + '/processed_ATL06_20190222010344_08490205_006_02.h5')
reader.vars.append(beam_list=['gt1l'], var_list=['h_li', "latitude", "longitude"])
reader.load()

Notes

Open questions (Now resolved)

  1. Do we think it is important to maintain backwards compatibility given that this is a class that has, up until now, been for internal use only? --> We will raise DeprecationErrors for using old arguments, but won't fully support old functionality
  2. What are the most user-friendly input arguments? (See final decision described above)
  3. _One aspect of reading that has now changed is that a user is allowed to read a data without appending any variables. This previously caused an error (UnboundLocalError: local variable 'groups_list' referenced before assignment). Now the code loads a dataset object that is basically empty. Do we want this? --> We will raise a new descriptive error if the user has not added any variables before reading_
github-actions[bot] commented 10 months ago

Binder :point_left: Launch a binder notebook on this branch for commit e3566f8765610b1f239ad46eb951d326c1666e07

I will automatically update this comment whenever this PR is modified

Binder :point_left: Launch a binder notebook on this branch for commit 44fd8ccc2109b579a47cbff546eb41e7eaeca096

Binder :point_left: Launch a binder notebook on this branch for commit 83d24fb3e40503c3d31f0429ebdc1cf018392fb0

Binder :point_left: Launch a binder notebook on this branch for commit a187328d3ff2640ca59071f8fdacc002dd8ac6e4

Binder :point_left: Launch a binder notebook on this branch for commit 593b9d19cc86052152c639929dae579991906a41

Binder :point_left: Launch a binder notebook on this branch for commit 652a815fcb24b6414a20d377c567cdf26119d840

Binder :point_left: Launch a binder notebook on this branch for commit 120694a11a62a93ffa335758016b46ec247daed4

Binder :point_left: Launch a binder notebook on this branch for commit b4d59d6cd18b44247d22e2cc600a63785ca19824

Binder :point_left: Launch a binder notebook on this branch for commit 3f4bfa7274b4ef1c7d686206b22f752f1e9745f9

Binder :point_left: Launch a binder notebook on this branch for commit a29e7565489abe46baa059df7fc4451f5fea0bff

Binder :point_left: Launch a binder notebook on this branch for commit 3f3cb1f61d601517147888c21bce953fe0479b30

Binder :point_left: Launch a binder notebook on this branch for commit d0838c8c9528a23ceb3dee7900c97ee0887ee4f0

Binder :point_left: Launch a binder notebook on this branch for commit 8127dbf9ae9790d93b95ff8e55603cf710bb6cbc

Binder :point_left: Launch a binder notebook on this branch for commit b3341c12bad5846818256b67712a7c7677f7188a

Binder :point_left: Launch a binder notebook on this branch for commit a184d9cd7bece9da303b3998d841bdcb641a5233

Binder :point_left: Launch a binder notebook on this branch for commit 8ff1e70656d42f0a842ca85cfa119e819e0d236e

Binder :point_left: Launch a binder notebook on this branch for commit 32175656a3ac4b9186a169b4a07b630423b16814

Binder :point_left: Launch a binder notebook on this branch for commit c0e5f4e80e25183bc6e5962536207bcccb658544

Binder :point_left: Launch a binder notebook on this branch for commit 5de9173ec4734adec57396768e588cf78f6ad2da

Binder :point_left: Launch a binder notebook on this branch for commit b552808a02389817ae735286b07f471894cf63ab

Binder :point_left: Launch a binder notebook on this branch for commit 37bcc96aa5943ab39cea3618e59b04bc5ff808e3

Binder :point_left: Launch a binder notebook on this branch for commit afbec9802c639d80f9fd54f38a7acac0acda170b

Binder :point_left: Launch a binder notebook on this branch for commit 5f7fc0aaa867de05d9e460b493f16eee61ec164b

Binder :point_left: Launch a binder notebook on this branch for commit 9c4005a07fdbb5788b394762a824b6567a7f3dc0

Binder :point_left: Launch a binder notebook on this branch for commit 170bd98a4f10b967a8bc0a218666f61e146d3f46

Binder :point_left: Launch a binder notebook on this branch for commit ef26223e8577b7610b91cf9e602316762590b556

rwegener2 commented 10 months ago

Icepyxstructuresdiagram A visual guide I was using to think about the structures we have available to users

rwegener2 commented 9 months ago

Outcomes of a discussion with @JessicaS11:

Do we think it is important to maintain backwards compatibility given that this is a class that has, up until now, been for internal use only?

What are the most user-friendly input arguments?

We will have:

  1. a file (local or, in a future PR, s3). Check to see if there is an existing standard for what to call this, otherwise use filepath
  2. product (and optionally version) — Default to the latest version if none is given (Check into the functionality for finding the latest version in Query class).
  3. We actually don't want Query object as an input. self.variables is an attribute of a Query object, so implementing Query as an input arg would be circular.

We won't aggregate these into a single data_source argument. Instead make all arguments optional, but enforce that either a filepath or a product argument (but not both) is given.

Re: The docs

Plan is to use the existing page for nested variables, but restructure the page so it starts talking generally about the Variables class, then transitions to showing how you can also use the class directly from a Query object.

rwegener2 commented 9 months ago

This PR is almost totally ready for review. I ran into one bug while running all the example notebooks that I'm looking for some help with.

In line 479 of variables.py there is an if/else statement:

    478 if self._vartype == "order":
    479     nec_varlist = [
    480         "sc_orient",
    481         "sc_orient_time",
   (...)
    488         "end_delta_time",
    489     ]
    490 elif self._vartype == "file":

In this new iteration of Variables there is no more vartype. If the vartype used to be order now Variables just takes in the product and version and generates a variables list from that.

So my question is: what was the purpose of using different sets of nec_varlists based on the vartype? I'm trying to figure out if this can be modified to run without knowing a vartype. cc: @JessicaS11

Context

If it is helpful, I found this while running the Building the default variables list section of the Subsetting notebook. The line:

region_a.order_vars.append(defaults=True)
pprint(region_a.order_vars.wanted)

Raised the error:

AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 region_a.order_vars.append(defaults=True)
      2 pprint(region_a.order_vars.wanted)

File ~/icepyx/icepyx/core/variables.py:478, in Variables.append(self, defaults, var_list, beam_list, keyword_list)
    475 self._check_valid_lists(vgrp, allpaths, var_list, beam_list, keyword_list)
    477 # add the mandatory variables to the data object
    478 if self._vartype == "order":
    479     nec_varlist = [
    480         "sc_orient",
    481         "sc_orient_time",
   (...)
    488         "end_delta_time",
    489     ]
    490 elif self._vartype == "file":

AttributeError: 'Variables' object has no attribute '_vartype'
review-notebook-app[bot] commented 9 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

JessicaS11 commented 9 months ago

So my question is: what was the purpose of using different sets of nec_varlists based on the vartype?

Fundamentally, the Variables instance from within query is of vartype "order", while the Variables instance from read is of vartype "file". For users using variable subsetting to order data, we wanted to make sure that they got back some minimal set of variables they'd need (or appreciate having) to work with the data (but wouldn't necessarily order on their own initially) without needing to place another data order. This was likely more important when the data was just coming out and there was no a priori knowledge for most lab groups on what variables/groups/paths they might want. From the read side, there are a few pieces of information we need to be able to index/manipulate/merge across granules, so this allowed us to mandate they're included.

(As an aside: The non-implemented interactions of a query.variables instantiation with a type "file" confuses this, and if memory serves were the start of making it possible for the user to create a query object by supplying an ICESat-2 granule (i.e. file), I think with the goal that they could then still use the variables module to explore the variables in the file. It's probably worth my trying to find some more background on this idea (EDIT: #34) and, I'm guessing, removing this non-implemented code from query to clean things up.)

I'm trying to figure out if this can be modified to run without knowing a vartype

I think the short answer is "yes". From the read end, we can pretty easily add our own list with an .append(vars_we_want) call before we start loading. For the query (so on-prem subsetting) usage, the user directly constructs their vars.wanted and supplies it via the coverage kwarg to query.order(coverage=[]) or query.download(coverage=[]), so we'd have to somehow make sure the defaults were included when a user does variable subsetting.

Brainstorming Options

I suppose one alternative would be to take the minimal list required for the read.load application and use that for all cases (thus removing a few of the vartype="order" defaults). I think I would prefer the per-variables-object model described above though because I suspect that will get us closer to generating (for instance) Andy's hackweek list of variables and ultimately leaves things more flexible.

rwegener2 commented 9 months ago

Alright, I've marked this are "Ready for Review", although it isn't 100% complete. I'm getting a bit turned around about how to remove the nec_varlist from Variables. I think I did it fine for the Read class, but I'm getting lost tracing the Coverage parameter for the Query class. Instead of surging ahead on Query I thought it would be worth it to request review for the current state, and then have a conversation to outline how to update the Query class.

rwegener2 commented 9 months ago

Okay, this is ready for review! I've been updating the PR description so it is up-to-date with test examples and description.Two comments:

  1. I don't know well how the gridded products are different from the non-gridded products. That's a blind spot I have so keep an eye out for assumptions I may have made.
  2. In the way I was testing the Read functionality (example in the PR description) I was downloading a product with Query then reading it with Read. It seems that Read required cycle_number and rgt, but Query didn't require those. That felt odd, because I couldn't read a file with icepyx that I downloaded with icepyx, so I added those two variables to the Query required var_list. If there is a reason that isn't a good idea let me know and I can take them out.
  3. I'm requesting a little extra focus on the Variables append method. I didn't understand 100% of the parts I was modifying, so just requesting an extra look at that part.

Looking forward to the review!

P.S. I would like everyone to know that I did a global search for "depreciate" before committing. Look how far I've come 😛