microsoft / PlanetaryComputerDataCatalog

Data catalog for the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com
MIT License
35 stars 15 forks source link

access datasets through `fsspec` and `adlfs` #348

Closed ngam closed 2 years ago

ngam commented 2 years ago

Hi, thanks for providing these datasets. I would like to access the goes-cmi dataset through fsspec and adlfs (i.e. pip install fsspec adlfs), but I cannot seem to figure it out.

It seems the account_name associated with these datasets is pcstacitems? That doesn't seem to be documented. Anyway, I cannot get a straightforward anonymous access to any dataset. For example, this code

import fsspec
fs = fsspec.filesystem("abfs", account_name="pcstacitems")
fs.ls("/")  # or fs.ls("abfs://items/" and so on

errors with

ErrorCode:NoAuthenticationInformation
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>NoAuthenticationInformation</Code><Message>Server failed to authenticate the request. Please refer to the information in the www-authenticate header.

Or sometimes with things like "not found". Is there an easy way to access these datasets through fsspec? I also tried initiating a SAS token, but to no avail.

I also tried to figure out the details from this, but to no avail. https://planetarycomputer.microsoft.com/api/stac/v1/collections/goes-cmi/items/OR_ABI-L2-F-M6_G17_s20222200300319

For example, this also didn't work.

import fsspec
fs = fsspec.filesystem("abfs", account_name="goeseuwest")
fs.ls("noaa-goes-cogs/goes-17")

It would be nice if these datasets could be provided more transparently. I understand the desire to make them easy to use via Jupyter notebooks (like the examples) but I found those to be extremely hard for proper application/deployment (beyond the simple examples). For comparison, the GOES datasets could be easily used on AWS and GCP without any issue:

import fsspec
gcp = fsspec.filesystem("gs", token="anon", anon=True)
gcp.ls("gcp-public-data-goes-17/")  # prints ['gcp-public-data-goes-17/ABI-L1b-RadC'...
aws = fsspec.filesystem("s3", token="anon", anon=True)
aws.ls("noaa-goes17/")  # prints ['noaa-goes17/ABI-L1b-RadC', 'no...

Now trying:

# now trying:
az = fsspec.filesystem("abfs", token="anon", anon=True)

results in this error:


ValueError: Must provide either a connection_string or account_name with credentials!!

Thank you!

ngam commented 2 years ago

Okay, this actually works:

az = fsspec.filesystem("abfs", account_name="goeseuwest")
>>> az.ls("noaa-goes17/")  # prints: ['noaa-goes17/ABI-L2-CMIPC',...
TomAugspurger commented 2 years ago

A couple questions:

  1. Did you see https://planetarycomputer.microsoft.com/dataset/goes-cmi#Example-Notebook?
  2. Are you trying to access the COGs or the NetCDF files?

The NetCDF files are in a public storage container, and so can be accessed as you showed. The COGs can also be accessed using fsspec / adlfs, but they're in a private storage container so you need to get a short-lived SAS token, as showing in that example notebook and explained in https://planetarycomputer.microsoft.com/docs/concepts/sas/. I'll update https://planetarycomputer.microsoft.com/docs/concepts/sas/ to show an example with fsspec, but it would be

token = planetary_computer.sas.get_token("goeseuwest", "noaa-goes-cogs").token  # or use requests.
fs = fsspec.filesystem("abfs", account_name="goeseuwest", credential=token)

Our STAC collections all include the storage information under msft:storage_account and msft:container

TomAugspurger commented 2 years ago

I understand the desire to make them easy to use via Jupyter notebooks (like the examples) but I found those to be extremely hard for proper application/deployment (beyond the simple examples).

I'd be curious to hear more about what you found challenging about the workflow presented in the notebook, if you're able to share. We have users accessing the data via STAC in production / application environments too.

ngam commented 2 years ago

A couple questions:

Sure. I tried previously engaging with this project and I am happy to help and be a productive user/contributor, though I wish you could be a little more open about improvement requests (especially in the containers/images repo). The containers are quite out of date, especially the tensorflow one. And I personally work really hard on improving conda-forge's offerings of tensorflow and pytorch (as well as a cuda build of jaxlib, etc.) so that projects like this (and pangeo) can benefit from them.

  1. Did you see https://planetarycomputer.microsoft.com/dataset/goes-cmi#Example-Notebook?

Yes, I have looked at this many times (see below for the fuller answer on what I find challenging and draining about the example workflows)

  1. Are you trying to access the COGs or the NetCDF files?

I am only trying to access NetCDF files for now. I wish I could make enough progress to make use of COGs, but...

**

I'd be curious to hear more about what you found challenging about the workflow presented in the notebook, if you're able to share. We have users accessing the data via STAC in production / application environments too.

I am (supposed to be) a scientist (believe me) and I find understanding arrays to be quite straightforward (np.array, tf.Tensor, whatever --- they're all quite straightforward). So, when I want to interact with numerical objects, I always try to make them into arrays as soon as I can. So if I want to look at the GOES-17 datasets, the easiest thing for me is to just load an array and then deal with it with my own tools, the way I want, and it is just a new array replacing an older array.

By comparison, the workflows you present are really creative and useful database-centric solutions (or "query" approaches) --- and I find that super frustrating! Why, let's look at this example.

bbox = [-67.2729, 25.6000, -61.7999, 27.5423]
search = catalog.search(
    collections=["goes-cmi"],
    bbox=bbox,
    datetime=["2018-09-11T13:00:00Z", "2018-09-11T15:40:00Z"],
    limit=1,
    query={"goes:image-type": {"eq": "MESOSCALE"}},
)

In order for me to use this, I need to try to understand what this bbox representation is. I know it sound weird coming from someone who worked with satellite data before, but I spent hours and I couldn't really get my head around this bbox standard --- what if I want the whole image? What if I simple want this to be directly mapped onto the lon and lat provided by the netCDF4 files? What if I want this to be mapped on the gridded points? Funnily enough, the plot presented in the notebook above uses the GOES x and y coordinates instead of whatever these bbox coordinates are (I know, they're lons and lats, but they're organized in a weird way).

One could ask similar questions about the limit and the query, etc., but you get the point: It forces the user to learn new tools and new software just to use your project and that is the definition of an entry/exclusion barrier. I (really) sympathize with your approach because I think you made an educated guess about what would be easiest/best for users who may know nothing about arrays, but a balance must be struck between ease for users who nothing and users who are comfortable with numpy arrays and so on.

**

For reference, I am trying to assemble a data pipeline that roughly has the following criteria:

Would something like below work?

bbox = [-inf, -inf, +inf, +inf] # or whatever the predetermined bounds are
search = catalog.search(
    collections=["goes-cmi"],
    bbox=bbox,
    datetime=["2020-01-00T00:00:00Z", "2021-01-00T00:00:00Z"],
    limit=1, # how does this have to change?
    query={"goes:image-type": {"eq": "MESOSCALE"}}, # what would be the "CONUS" equivalent here?
)

When I spent a lot of time on this previously, I just couldn't even get past the bbox so I gave up. Instead, what I ended up doing was basically download TBs of data locally and then rely on the using the file names and time variable provided by the netcdf4 variables/keys to assemble my data pipeline and it worked very nicely. Now, I am trying to expand the project and I found using fsspec to be the most intuitive and easiest coupled with h5py, because I could easily transfer my code from one provider to another, including even "local":

fs = fsspec.filesystem("abfs", ...)
fs = fsspec.filesystem("gs", ...)
fs = fsspec.filesystem("s3", ...)
fs = fsspec.filesystem("file", ...) # for local files

Two random notes/questions maybe you could help me with:

**

I know this sounds like a rant, and it is, but please know that my motivation and willingness to spend time writing this long response and continuing to try to make use of this project is a sign of admiration of this work and I only want it to be better. Thank you for this wonderful resource and your time and commitment and engagement. My critique comes from a place of love and admiration, neither scorn nor hatred, and I really want this project to be even better.

TomAugspurger commented 2 years ago

Thanks for the feedback!

(especially in the containers/images repo).

I assume you're referring to https://github.com/microsoft/planetary-computer-containers/pull/38 and https://github.com/microsoft/planetary-computer-containers/pull/39. I am indeed happy to review those. You happened to submit those at a time when my attention was elsewhere. I don't get to spend as much time on maintaining the Hub as I'd like, since we have so many priorities. Though if you do reopen those I'll need a bit of time before I'm able to review them since I'm in the middle of preparing an AKS cluster migration for our Hub. I'd love to have an up-to-date tensorflow / jax image.

Did you see https://planetarycomputer.microsoft.com/dataset/goes-cmi#Example-Notebook?

Yes, I have looked at this many times (see below for the fuller answer on what I find challenging and draining about the example workflows)

Yeah, sorry, I didn't see that you'd referenced that in your original post, which lead to my followup post. Apologies for missing that.

It sounds like our Introduction to STAC could use improving / additional cross-linking. (In this case, we're implementing the STAC-API and will return any items intersecting with the user-providing bbox; I'll update the examples to make that clearer).

Funnily enough, the plot presented in the notebook above uses the GOES x and y coordinates instead of whatever these bbox coordinates are (I know, they're lons and lats, but they're organized in a weird way).

Yes, that's a STAC detail that could be better explained: STAC items are GeoJSON features and so the bbox and geometry are always in WGS84 latitude / longitude. The assets themselves can be in whatever native projection they need (and the properties under the proj field capture this). I'll add that to the STAC introduction notebook. Thanks for pointing it out as a source of confusion.

but a balance must be struck between ease for users who nothing and users who are comfortable with numpy arrays and so on.

Indeed, we do find STAC useful. But we are also happy that the raw assets are available directly from blob storage. If you're more comfortable with that workflow, and the STAC API isn't providing anything useful, then you're free to go straight to the assets in blob storage using adlfs / azure.storage.blob / REST APIs. Regardless of whether not you're going through STAC, you're still accessing the same raw ndarrays from Azure Blob Storage.

I'll update the introduction quickstarts to share a direct to blob storage example.

Instead, what I ended up doing was basically download TBs of data locally and then rely on the using the file names and time variable provided by the netcdf4 variables/keys to assemble my data pipeline and it worked very nicely.

In general, we recommend against this type of workflow and recommend a pangeo-style workflow that streams data directly into memory. That said, if you have the local storage and are bottlenecked by network bandwidth (which I guess can be common for deep learning workflows?), it might make sense to start your pipeline by downloading data locally. Fortunately, you have that option since you have direct access to the files in blob storage.

This does highlight one of the reasons we (as a data provider) like STAC so much. Since we're hosting many different datasets, STAC gives us a standardized way to express the metadata that you're extracting from the filenames. Other datasets use different naming conventions, so you'll need custom code to parse the paths of each dataset you're using. But if you're only using one or a few datasets, that's not a big deal.

Now, I am trying to expand the project and I found using fsspec to be the most intuitive and easiest coupled with h5py, because I could easily transfer my code from one provider to another, including even "local":

Just a note, depending on the files you might see pretty poor performance when opening the metadata from NetCDF files from Blob Storage (https://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud). But perhaps that's not an issue for your workflow. Either way, if you're reading all of the data from a NetCDF file anyway, it'll be faster to download it locally before opening it. You'll be accessing the same number of bytes, but doing it in fewer HTTP requests.

I previously tried using the xarray package and I found it significantly slower than the plain netCDF4 locally, so I stopped using it even though xarray is way more convenient to use... Am I missing/messing something here?

It's a bit hard to say, but if you're accessing the NetCDF files from Blob Storage then my previous note about HDF in the cloud might be the culprit. xarray tends to do a bit more work, and might load a bit more metadata than h5py.File / however netCDF4 opens a file. https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html documents a few keywords that should get xarray pretty close to the underlying engine though: decode_cf, decode_times, use_cftime, etc.

You mention COGs and I believe I read that they can be really efficient and nice to use: Could you please provide me with a reference so that I can try to use them to see if they can fit nicely in my workflow?

https://planetarycomputer.microsoft.com/dataset/goes-cmi#Example-Notebook and https://github.com/microsoft/PlanetaryComputerExamples/blob/main/tutorials/hurricane-florence-animation.ipynb, which use STAC, hopefully gives a decent overview.

Unfortunately, I ran into https://github.com/opendatacube/odc-stac/issues/85, so I'm still debugging that. I'll get back to you on a more complete (possible) workflow. That said, if you have something that works using Blob Storage directly, then by all means use that!

One question about this:

sampled 4x daily (00, 06, 12, 18 hrs) as many as available per hour (maybe 10) for at least an entire year (so at least, (4)(10)(365)

Which GOES image types are you interested in? There was a comment about "CONUS". And do you want a single ndarray with assets from both GOES-16 and 17, or would those be in two separate ndarrays?

It looks like there are 6 CONUS assets per hour:

image

Thanks again for the critical feedback, I really do appreciate it. And let me know if I missed anything. I'm going through all of our notebooks following some changes in https://github.com/stac-utils/pystac-client, so I'll include some of your suggestions here:

  1. Show examples using fsspec / adlfs directly
  2. Cover some of the STAC API fundamentals like limit, coordinate systems, query, intersections, etc.
ngam commented 2 years ago

Which GOES image types are you interested in? There was a comment about "CONUS". And do you want a single ndarray with assets from both GOES-16 and 17, or would those be in two separate ndarrays?

I have been only looking at GOES-17 (and from that really only "CONUS" GOES-17, which is the stuff over the pacific --- the science/physics question I am after is to do with clouds over the pacific). And no need to merge them or anything --- just one ndarray per image. Yes, the formal number I think is 6, but I am actually not sure. I can look into it. (I through the number 10 out there because I knew it'd be on the order of 10.)

Now, to give you even more background: So we have these 16 or so channels from GOES-17, right? I put these (usually one or three) into a vision model, so each channel can be separate or they can be concatenated (it is fine if it is done afterwards). The the timing here (4x daily: 00, 06, 12, 18) is because the other datasets I am coupling with the imagery are at most 4x daily at these hours. But once the early stages of this project can be effectively utilized, I will move to map more varied and potentially more high-frequency data (but these may prove useless for the timescales I am interested in: atmospheric clouds...)

--

Just in general:

Please feel free to use the community to help you out! This is the whole point of this being open source. For example, if you open a PR with that pandas code snippet, I can help make it into a tutorial or something. I can also resubmit the PRs to update the docker containers soon.

  1. Show examples using fsspec / adlfs directly
  2. Cover some of the STAC API fundamentals like limit, coordinate systems, query, intersections, etc.

How about you initiate a PR or two and I can review and/or help you cover the view from a scientist? Please feel free to tag me and I will be there and I will contribute.

--

I am saving your response as resource and I will be coming back to it as I move this project forward. Thank you so much!!!

ngam commented 2 years ago

For the record, I am amazed by the tech and details of all of this. I am by no means discouraging this innovative way of doing stuff ... I just think it could be slightly clearer and more intuitive, and that's my principal goal in offering to help and spend time giving feedback. I have a lot of other resources on-premise (hence downloading and effectively using TBs of data) but I know that's a losing game and I want to be able to use these resources effectively. So I am excited to learn how to use the STAC API more effectively going forward!

ngam commented 2 years ago

It looks like there are 6 CONUS assets per hour:

Hmmm...

>>> az = fsspec.filesystem("abfs", account_name="goeseuwest")
>>> len(az.glob(("noaa-goes17/ABI-L2-CMIPC/2020/001/00/*M6*C01*")))
12
>>> len(az.glob(("noaa-goes17/ABI-L2-CMIPF/2020/001/00/*M6*C01*")))
6

I believe the "F" is for full disk and "C" is for CONUS. I usually work with ABI-L1 data, but these are not available on Azure. Fortunately, I believe getting the L2 "CMI" like this is very close to the original L1 data. I just need to figure out the necessary corrections for channels (for the vis chs, 01 through 03, I think one only needs kappa0) but I need to double check

TomAugspurger commented 2 years ago

Thanks. I’ve been out sick the past couple days but I’ll take a look when I’m back.

ngam commented 2 years ago

Feel better and please take your time :)