Expand Variables class to read s3 urls

rwegener2 commented 9 months ago

Goal

This PR generalizes the Variables class to display available variables from an s3 path. If an s3 path is given the Variables class uses the product and version to display available variables; it does not try to walk through the file and build a the variables tree. If a user tries to request data from outside the NSIDC bucket a warning is raised indicating that available variable lists may not be accurate.

How it was done

So simply! I'm excited about how easy this was, because it shows how worth it the efforts were to 1) create an auth Mixin and 2) make Variables an independent class 🙂

The biggest changes were in is2ref.py, in which the extract_product and extract_version functions were generalized to read from s3. They now accept an optional auth argument.

How it can be tested

from icepyx.core.variables import Variables

s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2019/11/30/ATL09_20191130215436_09930501_006_01.h5'
v = Variables(path=s3_path)
print(v.product)
print(v.version)
v.avail()

As a template for anyone ever writing future tests, the extract_product and extract_version functions should also be tested, for example as:

import earthaccess
from icepyx.core.is2ref import extract_product

auth = earthaccess.login()
s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2019/11/30/ATL09_20191130215436_09930501_006_01.h5'
extract_product(s3_path, auth=auth)

github-actions[bot] commented 9 months ago

:point_left: Launch a binder notebook on this branch for commit 11625ecb26d45ad0452bb195c8cd1182fa321185

I will automatically update this comment whenever this PR is modified

:point_left: Launch a binder notebook on this branch for commit 0f18d568de6b58738a4503ef831c390c90e132a9

:point_left: Launch a binder notebook on this branch for commit 2e84bbca1447d872c4b1699d6a14c5ea647b2110

:point_left: Launch a binder notebook on this branch for commit 1e0bc69d93f606360ef21525e95bcb71a12f25ec

:point_left: Launch a binder notebook on this branch for commit 150d763fd0338371afeb818cc733a621b3938cea

:point_left: Launch a binder notebook on this branch for commit ddd254034c7aa7a78191f606cd8ffa952aaa76d0

:point_left: Launch a binder notebook on this branch for commit 0837bd790787ee5072ab7f1fe8a3ea8d13282cb2

:point_left: Launch a binder notebook on this branch for commit c77de888dc8ef7ac2211e847e451956054c1ed2b

:point_left: Launch a binder notebook on this branch for commit 454e30b2ecfe33c4ea0a3d240481db59fd0852ce

:point_left: Launch a binder notebook on this branch for commit f84bb32ccb4614a7cad88b4e1ab172a6de1a2667

:point_left: Launch a binder notebook on this branch for commit ef7aa1ccd5068cc98bef47323ec3aa5db5027e7e

rwegener2 commented 8 months ago

This comment is to start documenting timing for s3 reads. Below I compare wall time across 3 ways of initializing a Variables instance and listing the available variables.

Times

Syntax	Description	Instantiation Time	`.avail()` Time
`v = Variables(path=local_path)`	Initializing from a local file	448 ms ± 24.4 ms	215 ns ± 3.52 ns
`v = Variables(product='ATL09', version='006')`	Initializing with a user given product/version	200 ms ± 24.1 ms	220 ns ± 0.798 ns
`v = Variables(path=s3_path)`	Initializing with an s3 filepath	6.24 s ± 203 ms	219 ns ± 8.26 ns

Notes: Timing done using the %%timeit magic. Earthdata Login already established for all runs, so authentication time is not included in any of the times

Highlights

Initializing a Variables object with an s3 path is ~14 times slower than a with a local file, but the .avail() call is just as fast as local because product and version are known. I assume the most time intensive part of initializing a Variables class with an s3 url is reading the product and version from the s3 file.

I still think we should merge this PR

This is not super hopeful news off the bat, but I still think we should merge this and work on variable reads. The alternate approach to reading the product from the file is to use the filepath, but that had it's own set of dev challenges and also restricts icepyx to only NSIDC s3 data. I'm curious to get a more holistic view of read times before trying to optimize any particular piece.

rwegener2 commented 8 months ago

Alright @JessicaS11 this is ready for review! My only uncertainty is that Travis is hanging again 😕 Let me know what you think!

JessicaS11 commented 8 months ago

@rwegener2 This is super close. Can you confirm it all still looks good from your perspective after all the merge conflicts, rebasing, etc.? I think we just need to update the line in our variables example to note that you can create a Variables object from local file, s3 url, or product+version.

UPDATE: I randomly tried to read in a file while on this branch, and it forced me to authenticate even though I was reading a local file. Not sure if that's something we inadvertently introduced here, but figured it was the logical place to record it until one of us digs in!

rwegener2 commented 8 months ago

Thanks for helping move this to the finish line @JessicaS11! I think this looks ready to merge.

I randomly tried to read in a file while on this branch, and it forced me to authenticate even though I was reading a local file.

TLDR; This was unintended and it is not desired. As a fix I added an if statement in the init of Variables that checks if the path is s3 before authenticating.

Good catch. So it seems like one of the design elements I was most proud of in the auth module is working against us right now. We setup auth.py such that self.auth could exist anywhere as None, thus, but the actual authentication would only happen when self.auth was called. This makes authentication available anywhere but delays the process of authentication until it is actually needed. By passing self.auth into the self._product = is2ref.extract_product(self._path, auth=self.auth) line of the Variables init, we are instructing the auth module to authenticate even though authentication isn't needed. This is a result of making it so the extract_product function works on either a local or s3 file. If we take auth=self.auth out then the s3 reads that need auth won't work.

The fix I am pushing adds an if statement in the Variables init to check if the path is an s3 path and creates a local variable that is either self.auth or None. I'm not totally sold that's the best option, but I wasn't in love with any of my other ideas either. I considered:

split validate_product into two functions - local and s3 versions. Call them separately in the variables init
make validate product a class method of variables (and also read)
make validate product a class method of auth
make the auth class an attribute of the classes instead of integrating it directly as a mixin

It's possible one of these other ones is better, but my feeling right now is to wait on one of these bigger restructures to see if there are other methods that get created as the cloud features of icepyx develop. The needs of those other functions could help influence how we approach these situations.

review-notebook-app[bot] commented 8 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

JessicaS11 commented 8 months ago

The fix I am pushing adds an if statement in the Variables init to check if the path is an s3 path and creates a local variable that is either self.auth or None.

This makes sense to me, in combination with

It's possible one of these other ones is better, but my feeling right now is to wait on one of these bigger restructures to see if there are other methods that get created as the cloud features of icepyx develop. The needs of those other functions could help influence how we approach these situations.

It's possible that some decorators could ultimately help us with this kind of issue - especially if the "check" consistently stays as "does this start with "s3"?".

I added a few notes in the vars example. It looks like this is ready to go!

icesat2py / icepyx