Expand icepyx to read s3 data

rwegener2 commented 11 months ago

Goal

NSIDC has put IS-2 data in the cloud, so we would like icepyx to be able to access that data for users. ☁️💙

How it was done

The EarthdataAuth class was added as a mixin to the Read class. The product and version are read from the file the same way they are read in using a local file. The python standard glob module does not accept s3 paths, so s3 paths are not glob-able (i.e. the user must always give the full exact filepath). There is an s3 version of glob, so this could be implemented in the future. Mixed lists of s3 paths and local paths are not allowed. If a user tries to load more than two files or three data variables at the same time they are given a warning that discourages continuing and requires user confirmation to proceed.

Very loose timing suggests that it takes about 5 minutes per data variable per file to load s3 data (testing on the dem_h variable of an ATL09 file). Running two files took 11 minutes.

How to test it

from icepyx.core.read import Read

s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2020/06/18/ATL09_20200618112352_12820701_006_01.h5'
reader = Read(s3_path)
reader.vars.append(var_list=['dem_h'])
reader.load()

github-actions[bot] commented 11 months ago

:point_left: Launch a binder notebook on this branch for commit 7801b333f4bc1307abf07c6a2e38cec76657d8d1

I will automatically update this comment whenever this PR is modified

:point_left: Launch a binder notebook on this branch for commit 6c4109afc8dc42f5ad69665d8fc7ba436645ed1c

:point_left: Launch a binder notebook on this branch for commit 6a9da5313ffb6f8329de92741c8535df31b6587c

:point_left: Launch a binder notebook on this branch for commit 9dcfd5a4e67578b8a45d5cfd7e6d724814211b5d

:point_left: Launch a binder notebook on this branch for commit 861fb836b5eb61d5f7d999701bf55f2411588ad5

:point_left: Launch a binder notebook on this branch for commit f3191a213ff69c0b4396a033e849cd39d359c950

:point_left: Launch a binder notebook on this branch for commit b0828d8e236ba37bb945cd6da80ca33f458cf506

:point_left: Launch a binder notebook on this branch for commit a579ef6b5c9e59223b1451726b24df8c3ca614b3

:point_left: Launch a binder notebook on this branch for commit 1051671cbe1772c5795f6b7e4f1749a837231120

:point_left: Launch a binder notebook on this branch for commit f0260f1b51482b9230e14dfd9920541e92923f4b

:point_left: Launch a binder notebook on this branch for commit 59776fcca137656e42ae56ed706c8d46e66a56e2

:point_left: Launch a binder notebook on this branch for commit 8b5e071451a4db0930a9bab0d0dca64bae2752cb

:point_left: Launch a binder notebook on this branch for commit 5d9397ca12afeb380d6f3460a4f449441ab7dbb6

:point_left: Launch a binder notebook on this branch for commit 15d60818b9e0e29825c1a7828b69ac2bc3d52ca9

:point_left: Launch a binder notebook on this branch for commit db7dfe533ac555a40cccf15e4e9bb4e4d00e3c4d

:point_left: Launch a binder notebook on this branch for commit 3e8af7f1182eb0055d987205937605ae52361ba5

:point_left: Launch a binder notebook on this branch for commit fb6e2a1819b330d9f0fbc56149ad53e83c53e99c

:point_left: Launch a binder notebook on this branch for commit 79a1a5f2241ee0352be738e2ede25d88e9d533d5

rwegener2 commented 11 months ago

Results of the first timing run. I would like to do this testing a bit more systematically, but to give the first look from opening 1 variable `dem_h` of an ATL09 file (because I couldn't resist trying it out) we have:	Description	creating the reader (No. 1)	loading some data (No. 2)
Local File	429 ms ± 6.8 ms	11.1 s ± 43.2 ms
s3 File	3.18 s ± 58 ms	5min 58s

Reference Code

from icepyx.core.read import Read

local_path = '/home/jovyan/data/ATL09/processed_ATL09_20200618112352_12820701_006_01.h5'
s3_path = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL09/006/2020/06/18/ATL09_20200618112352_12820701_006_01.h5'

# `path` changes to either `local_path` or `s3_path`
reader = Read(local_path)  # This step was timed (No. 1)
reader.vars.append(var_list=['dem_h'])
reader.load()  # This step was timed (No. 2)

rwegener2 commented 10 months ago

Following up on the todo

consider how we want to handle [it] if a list of s3 files or mixed s3/local paths are given as data_source

Allowing users to mix s3 and local paths sounds technically complex for what I imagine is a marginal use case, so I suggest we explicitly error if that happens. The options for what we could allow I see as:

Allow users to give lists of files so long as they are all s3 paths
Only allow users to give a single file if that file is on s3

I'm leaning towards the second one, just because it's barely practical for a user to open a variable from one file, let alone multiple. I could be swayed either way though. I'd love to hear additional opinions here.

JessicaS11 commented 10 months ago

Following up on the todo

consider how we want to handle [it] if a list of s3 files or mixed s3/local paths are given as data_source

Allowing users to mix s3 and local paths sounds technically complex for what I imagine is a marginal use case, so I suggest we explicitly error if that happens.

Agreed. The error could suggest creating multiple read objects and merging the datasets after they're loaded in that use case.

The options for what we could allow I see as:
1. Allow users to give lists of files so long as they are all s3 paths

2. Only allow users to give a single file if that file is on s3
I'm leaning towards the second one, just because it's barely practical for a user to open a variable from one file, let alone multiple. I could be swayed either way though. I'd love to hear additional opinions here.

@scottyhq gave this tutorial during the AGU CryoCloud workshop yesterday. At the very end there's an example of directly reading in some ATL06 data using h5coro. Not sure if there's currently a way to leverage this implementation without the xarray extension piece, but it's worth thinking about how this space might evolve (there's another effort working on creating kerchunk files for IS2 data) and what the implications are for how data is read-in behind the scenes. I'm thinking that users should be allowed to submit a list of s3 urls (option 1), but if there's more than say 2 files and/or more than 3 variables requested they get a "it will take forever" warning (at a minimum) and we could consider requiring them to acknowledge it (e.g. require user input to proceed). For some reason I'm not feeling great about enforcing a one s3url limit, even if that's clearly the "right" thing to do to read in multiple files in a reasonable amount of time.

rwegener2 commented 10 months ago

I'm thinking that users should be allowed to submit a list of s3 urls (option 1), but if there's more than say 2 files and/or more than 3 variables requested they get a "it will take forever" warning (at a minimum) and we could consider requiring them to acknowledge it (e.g. require user input to proceed)

This makes sense to me as a path forward, @JessicaS11.

I looked at the tutorial of @scottyhq's you linked and it looks like it was accessing h5coro through sliderule. So to use this method we would have to use sliderule in icepyx and that sounds a bit like a spaghetti of data opening softwares.

It's good to know about all the other efforts going on. I am doubtful that icepyx will open data quickly until there a change of file format/structure/kerchunk reference/similar effort. As for h5coro @jpswinski has made a lot of helpful updates in the past few months and we are meeting tomorrow to talk about moving the xarray backend forward. 🤞🏻 that can be an option soon.

rwegener2 commented 9 months ago

Adding as a todo:

[x] Update the cloud reading notebook of the docs

review-notebook-app[bot] commented 9 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

rwegener2 commented 9 months ago

Alright this PR is ready for re-review. Changes since the last review:

addressing this catch from @JessicaS11 about asking for authentication even for local reads
fix a small error about parsing bytes for reading ATL03
update the cloud tutorial notebook

icesat2py / icepyx