Variables as an independent class

rwegener2 commented 10 months ago

Goal

The ability to list the available variables and their h5 paths is a unique and powerful feature of icepyx. This PR aims to make that first-order functionality, instead of functionality that is only used internally. I think this will be a useful user feature, and I think it makes sense to make this change before adding the s3 url capability.

How it was done

We anticipate that the most common use case is reading variables from a file. Alternately, all you need to produce an .avail() list of variables is a product and version. This information is used to ping an API, which returns the variables.

When using Variables() the user is required to supply EITHER:

A local filepath (variables are read directly from the file)
the product

The version argument is optional. If it is not supplied the latest version is assumed. If both path and product is given an error is raised.

How it can be tested (updated)

from icepyx.core.variables import Variables

# initializing with a filepath
v = Variables('data/ATL06/processed_ATL06_20190226005526_09100205_006_02.h5')
v.avail()

# initializing with a product only
v = Variables(product='ATL03')
v.avail()
print(v.version)  # returns '006'

# initializing with a product and version
v = Variables(product='ATL03', version='004')
v.avail()
print(v.version)  # returns '004'

And testing that Read and Query still work:

from icepyx.core.read import Read
ds = Read('data/ATL06/processed_ATL06_20190226005526_09100205_006_02.h5')
ds.vars.append(beam_list=['gt1l'], var_list=['h_li', "latitude", "longitude"])
ds.load()

from icepyx.core.query import Query
from icepyx.core.read import Read

# Create Query object
short_name = 'ATL06'
spatial_extent = [-55, 68, -48, 71]
date_range = ['2019-02-20','2019-02-28']
region = Query(short_name, spatial_extent, date_range)

# order and download granules
region.order_vars.append(beam_list=['gt1l'], var_list=['h_li', "latitude", "longitude"])
region.order_granules(Coverage=region.order_vars.wanted)
path = './download'
region.download_granules(path)

# open the files to see they have the required variables included (ex. sc_orient, atlas_sdp_gps_epoch)
reader = Read(path + '/processed_ATL06_20190222010344_08490205_006_02.h5')
reader.vars.append(beam_list=['gt1l'], var_list=['h_li', "latitude", "longitude"])
reader.load()

Notes

One function the Variables class was playing was also to enforce that the user requests certain variables. This is done both to help the user know what to want and also to help icepyx with future merges. I'm noting here that as this PR also represents a design change in which Variables no longer enforces that, Read and Query do. We implemented this by:
1. Read: The required var_list is appended to the wanted variables list (vars.wanted) during load. If vars.wanted does not exist an error is raised instructing the user to add variables before loading the dataset.
2. Query: We want the required variables to be appended only if the user has requested a subsetted list of variables. The user does this by passing the Coverage kwarg into order_granules (Commonly: region_a.order_granules(Coverage=region_a.order_vars.wanted)) Required variables will get appended to Coverage in the subsetparams method of Query.

Open questions (Now resolved)

Do we think it is important to maintain backwards compatibility given that this is a class that has, up until now, been for internal use only? --> We will raise DeprecationErrors for using old arguments, but won't fully support old functionality
What are the most user-friendly input arguments? (See final decision described above)
_One aspect of reading that has now changed is that a user is allowed to read a data without appending any variables. This previously caused an error (UnboundLocalError: local variable 'groups_list' referenced before assignment). Now the code loads a dataset object that is basically empty. Do we want this? --> We will raise a new descriptive error if the user has not added any variables before reading_

github-actions[bot] commented 10 months ago

:point_left: Launch a binder notebook on this branch for commit e3566f8765610b1f239ad46eb951d326c1666e07

I will automatically update this comment whenever this PR is modified

:point_left: Launch a binder notebook on this branch for commit 44fd8ccc2109b579a47cbff546eb41e7eaeca096

:point_left: Launch a binder notebook on this branch for commit 83d24fb3e40503c3d31f0429ebdc1cf018392fb0

:point_left: Launch a binder notebook on this branch for commit a187328d3ff2640ca59071f8fdacc002dd8ac6e4

:point_left: Launch a binder notebook on this branch for commit 593b9d19cc86052152c639929dae579991906a41

:point_left: Launch a binder notebook on this branch for commit 652a815fcb24b6414a20d377c567cdf26119d840

:point_left: Launch a binder notebook on this branch for commit 120694a11a62a93ffa335758016b46ec247daed4

:point_left: Launch a binder notebook on this branch for commit b4d59d6cd18b44247d22e2cc600a63785ca19824

:point_left: Launch a binder notebook on this branch for commit 3f4bfa7274b4ef1c7d686206b22f752f1e9745f9

:point_left: Launch a binder notebook on this branch for commit a29e7565489abe46baa059df7fc4451f5fea0bff

:point_left: Launch a binder notebook on this branch for commit 3f3cb1f61d601517147888c21bce953fe0479b30

:point_left: Launch a binder notebook on this branch for commit d0838c8c9528a23ceb3dee7900c97ee0887ee4f0

:point_left: Launch a binder notebook on this branch for commit 8127dbf9ae9790d93b95ff8e55603cf710bb6cbc

:point_left: Launch a binder notebook on this branch for commit b3341c12bad5846818256b67712a7c7677f7188a

:point_left: Launch a binder notebook on this branch for commit a184d9cd7bece9da303b3998d841bdcb641a5233

:point_left: Launch a binder notebook on this branch for commit 8ff1e70656d42f0a842ca85cfa119e819e0d236e

:point_left: Launch a binder notebook on this branch for commit 32175656a3ac4b9186a169b4a07b630423b16814

:point_left: Launch a binder notebook on this branch for commit c0e5f4e80e25183bc6e5962536207bcccb658544

:point_left: Launch a binder notebook on this branch for commit 5de9173ec4734adec57396768e588cf78f6ad2da

:point_left: Launch a binder notebook on this branch for commit b552808a02389817ae735286b07f471894cf63ab

:point_left: Launch a binder notebook on this branch for commit 37bcc96aa5943ab39cea3618e59b04bc5ff808e3

:point_left: Launch a binder notebook on this branch for commit afbec9802c639d80f9fd54f38a7acac0acda170b

:point_left: Launch a binder notebook on this branch for commit 5f7fc0aaa867de05d9e460b493f16eee61ec164b

:point_left: Launch a binder notebook on this branch for commit 9c4005a07fdbb5788b394762a824b6567a7f3dc0

:point_left: Launch a binder notebook on this branch for commit 170bd98a4f10b967a8bc0a218666f61e146d3f46

:point_left: Launch a binder notebook on this branch for commit ef26223e8577b7610b91cf9e602316762590b556

rwegener2 commented 10 months ago

Icepyxstructuresdiagram A visual guide I was using to think about the structures we have available to users

rwegener2 commented 9 months ago

Outcomes of a discussion with @JessicaS11:

Do we think it is important to maintain backwards compatibility given that this is a class that has, up until now, been for internal use only?

We won't maintain the existing features as fully backwards compatible, but anyone trying to use previous functionality will be given an DeprecationError

What are the most user-friendly input arguments?

We will have:

a file (local or, in a future PR, s3). Check to see if there is an existing standard for what to call this, otherwise use filepath
product (and optionally version) — Default to the latest version if none is given (Check into the functionality for finding the latest version in Query class).
We actually don't want Query object as an input. self.variables is an attribute of a Query object, so implementing Query as an input arg would be circular.

We won't aggregate these into a single data_source argument. Instead make all arguments optional, but enforce that either a filepath or a product argument (but not both) is given.

Re: The docs

Plan is to use the existing page for nested variables, but restructure the page so it starts talking generally about the Variables class, then transitions to showing how you can also use the class directly from a Query object.

rwegener2 commented 9 months ago

This PR is almost totally ready for review. I ran into one bug while running all the example notebooks that I'm looking for some help with.

In line 479 of variables.py there is an if/else statement:

    478 if self._vartype == "order":
    479     nec_varlist = [
    480         "sc_orient",
    481         "sc_orient_time",
   (...)
    488         "end_delta_time",
    489     ]
    490 elif self._vartype == "file":

In this new iteration of Variables there is no more vartype. If the vartype used to be order now Variables just takes in the product and version and generates a variables list from that.

So my question is: what was the purpose of using different sets of nec_varlists based on the vartype? I'm trying to figure out if this can be modified to run without knowing a vartype. cc: @JessicaS11

Context

If it is helpful, I found this while running the Building the default variables list section of the Subsetting notebook. The line:

region_a.order_vars.append(defaults=True)
pprint(region_a.order_vars.wanted)

Raised the error:

AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 region_a.order_vars.append(defaults=True)
      2 pprint(region_a.order_vars.wanted)

File ~/icepyx/icepyx/core/variables.py:478, in Variables.append(self, defaults, var_list, beam_list, keyword_list)
    475 self._check_valid_lists(vgrp, allpaths, var_list, beam_list, keyword_list)
    477 # add the mandatory variables to the data object
    478 if self._vartype == "order":
    479     nec_varlist = [
    480         "sc_orient",
    481         "sc_orient_time",
   (...)
    488         "end_delta_time",
    489     ]
    490 elif self._vartype == "file":

AttributeError: 'Variables' object has no attribute '_vartype'

review-notebook-app[bot] commented 9 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

JessicaS11 commented 9 months ago

So my question is: what was the purpose of using different sets of nec_varlists based on the vartype?

Fundamentally, the Variables instance from within query is of vartype "order", while the Variables instance from read is of vartype "file". For users using variable subsetting to order data, we wanted to make sure that they got back some minimal set of variables they'd need (or appreciate having) to work with the data (but wouldn't necessarily order on their own initially) without needing to place another data order. This was likely more important when the data was just coming out and there was no a priori knowledge for most lab groups on what variables/groups/paths they might want. From the read side, there are a few pieces of information we need to be able to index/manipulate/merge across granules, so this allowed us to mandate they're included.

(As an aside: The non-implemented interactions of a query.variables instantiation with a type "file" confuses this, and if memory serves were the start of making it possible for the user to create a query object by supplying an ICESat-2 granule (i.e. file), I think with the goal that they could then still use the variables module to explore the variables in the file. It's probably worth my trying to find some more background on this idea (EDIT: #34) and, I'm guessing, removing this non-implemented code from query to clean things up.)

I'm trying to figure out if this can be modified to run without knowing a vartype

I think the short answer is "yes". From the read end, we can pretty easily add our own list with an .append(vars_we_want) call before we start loading. For the query (so on-prem subsetting) usage, the user directly constructs their vars.wanted and supplies it via the coverage kwarg to query.order(coverage=[]) or query.download(coverage=[]), so we'd have to somehow make sure the defaults were included when a user does variable subsetting.

Brainstorming Options

I suppose one alternative would be to take the minimal list required for the read.load application and use that for all cases (thus removing a few of the vartype="order" defaults). I think I would prefer the per-variables-object model described above though because I suspect that will get us closer to generating (for instance) Andy's hackweek list of variables and ultimately leaves things more flexible.

rwegener2 commented 9 months ago

Alright, I've marked this are "Ready for Review", although it isn't 100% complete. I'm getting a bit turned around about how to remove the nec_varlist from Variables. I think I did it fine for the Read class, but I'm getting lost tracing the Coverage parameter for the Query class. Instead of surging ahead on Query I thought it would be worth it to request review for the current state, and then have a conversation to outline how to update the Query class.

rwegener2 commented 9 months ago

Okay, this is ready for review! I've been updating the PR description so it is up-to-date with test examples and description.Two comments:

I don't know well how the gridded products are different from the non-gridded products. That's a blind spot I have so keep an eye out for assumptions I may have made.
In the way I was testing the Read functionality (example in the PR description) I was downloading a product with Query then reading it with Read. It seems that Read required cycle_number and rgt, but Query didn't require those. That felt odd, because I couldn't read a file with icepyx that I downloaded with icepyx, so I added those two variables to the Query required var_list. If there is a reason that isn't a good idea let me know and I can take them out.
I'm requesting a little extra focus on the Variables append method. I didn't understand 100% of the parts I was modifying, so just requesting an extra look at that part.

Looking forward to the review!

P.S. I would like everyone to know that I did a global search for "depreciate" before committing. Look how far I've come 😛

icesat2py / icepyx