Closed rwegener2 closed 8 months ago
:point_left: Launch a binder notebook on this branch for commit 11625ecb26d45ad0452bb195c8cd1182fa321185
I will automatically update this comment whenever this PR is modified
:point_left: Launch a binder notebook on this branch for commit 0f18d568de6b58738a4503ef831c390c90e132a9
:point_left: Launch a binder notebook on this branch for commit 2e84bbca1447d872c4b1699d6a14c5ea647b2110
:point_left: Launch a binder notebook on this branch for commit 1e0bc69d93f606360ef21525e95bcb71a12f25ec
:point_left: Launch a binder notebook on this branch for commit 150d763fd0338371afeb818cc733a621b3938cea
:point_left: Launch a binder notebook on this branch for commit ddd254034c7aa7a78191f606cd8ffa952aaa76d0
:point_left: Launch a binder notebook on this branch for commit 0837bd790787ee5072ab7f1fe8a3ea8d13282cb2
:point_left: Launch a binder notebook on this branch for commit c77de888dc8ef7ac2211e847e451956054c1ed2b
:point_left: Launch a binder notebook on this branch for commit 454e30b2ecfe33c4ea0a3d240481db59fd0852ce
:point_left: Launch a binder notebook on this branch for commit f84bb32ccb4614a7cad88b4e1ab172a6de1a2667
:point_left: Launch a binder notebook on this branch for commit ef7aa1ccd5068cc98bef47323ec3aa5db5027e7e
This comment is to start documenting timing for s3 reads. Below I compare wall time across 3 ways of initializing a Variables instance and listing the available variables.
Syntax | Description | Instantiation Time | .avail() Time |
---|---|---|---|
v = Variables(path=local_path) |
Initializing from a local file | 448 ms ± 24.4 ms | 215 ns ± 3.52 ns |
v = Variables(product='ATL09', version='006') |
Initializing with a user given product/version | 200 ms ± 24.1 ms | 220 ns ± 0.798 ns |
v = Variables(path=s3_path) |
Initializing with an s3 filepath | 6.24 s ± 203 ms | 219 ns ± 8.26 ns |
Notes: Timing done using the %%timeit
magic. Earthdata Login already established for all runs, so authentication time is not included in any of the times
Initializing a Variables object with an s3 path is ~14 times slower than a with a local file, but the .avail()
call is just as fast as local because product and version are known. I assume the most time intensive part of initializing a Variables class with an s3 url is reading the product and version from the s3 file.
This is not super hopeful news off the bat, but I still think we should merge this and work on variable reads. The alternate approach to reading the product from the file is to use the filepath, but that had it's own set of dev challenges and also restricts icepyx to only NSIDC s3 data. I'm curious to get a more holistic view of read times before trying to optimize any particular piece.
Alright @JessicaS11 this is ready for review! My only uncertainty is that Travis is hanging again 😕 Let me know what you think!
@rwegener2 This is super close. Can you confirm it all still looks good from your perspective after all the merge conflicts, rebasing, etc.? I think we just need to update the line in our variables example to note that you can create a Variables object from local file, s3 url, or product+version.
UPDATE: I randomly tried to read in a file while on this branch, and it forced me to authenticate even though I was reading a local file. Not sure if that's something we inadvertently introduced here, but figured it was the logical place to record it until one of us digs in!
Thanks for helping move this to the finish line @JessicaS11! I think this looks ready to merge.
I randomly tried to read in a file while on this branch, and it forced me to authenticate even though I was reading a local file.
TLDR; This was unintended and it is not desired. As a fix I added an if statement in the init of Variables that checks if the path is s3 before authenticating.
Good catch. So it seems like one of the design elements I was most proud of in the auth module is working against us right now. We setup auth.py
such that self.auth
could exist anywhere as None, thus, but the actual authentication would only happen when self.auth
was called. This makes authentication available anywhere but delays the process of authentication until it is actually needed.
By passing self.auth
into the self._product = is2ref.extract_product(self._path, auth=self.auth)
line of the Variables init, we are instructing the auth module to authenticate even though authentication isn't needed. This is a result of making it so the extract_product
function works on either a local or s3 file. If we take auth=self.auth
out then the s3 reads that need auth won't work.
The fix I am pushing adds an if statement in the Variables init to check if the path is an s3 path and creates a local variable that is either self.auth
or None
. I'm not totally sold that's the best option, but I wasn't in love with any of my other ideas either. I considered:
validate_product
into two functions - local and s3 versions. Call them separately in the variables initIt's possible one of these other ones is better, but my feeling right now is to wait on one of these bigger restructures to see if there are other methods that get created as the cloud features of icepyx develop. The needs of those other functions could help influence how we approach these situations.
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
The fix I am pushing adds an if statement in the Variables init to check if the path is an s3 path and creates a local variable that is either self.auth or None.
This makes sense to me, in combination with
It's possible one of these other ones is better, but my feeling right now is to wait on one of these bigger restructures to see if there are other methods that get created as the cloud features of icepyx develop. The needs of those other functions could help influence how we approach these situations.
It's possible that some decorators could ultimately help us with this kind of issue - especially if the "check" consistently stays as "does this start with "s3"?".
I added a few notes in the vars example. It looks like this is ready to go!
Goal
This PR generalizes the Variables class to display available variables from an s3 path. If an s3 path is given the Variables class uses the product and version to display available variables; it does not try to walk through the file and build a the variables tree. If a user tries to request data from outside the NSIDC bucket a warning is raised indicating that available variable lists may not be accurate.
How it was done
So simply! I'm excited about how easy this was, because it shows how worth it the efforts were to 1) create an auth Mixin and 2) make Variables an independent class 🙂
The biggest changes were in
is2ref.py
, in which theextract_product
andextract_version
functions were generalized to read from s3. They now accept an optionalauth
argument.How it can be tested
As a template for anyone ever writing future tests, the extract_product and extract_version functions should also be tested, for example as: