intake / intake-stac

Intake interface to STAC data catalogs
https://intake-stac.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
110 stars 25 forks source link

support parquet files in catalog #50

Open christine-e-smit opened 4 years ago

christine-e-smit commented 4 years ago

As I said in #48, I was recently involved in group trying to use intake-stac with some data we have sitting in s3. This data is in parquet format. I've used intake-parquet on this data with no problem to get a dask data frame. But when I try with intake-stac,

import intake
from intake import open_stac_catalog
cat = open_stac_catalog('https://not.the.real.url/catalog.json')
df = cat["AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet"].get().to_dask()

I get the error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-10-25d227182f13> in <module>
----> 1 df = cat["AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet"].get().to_dask()

~/Software/python3/envs/stac/lib/python3.8/site-packages/intake/source/base.py in to_dask(self)
    219     def to_dask(self):
    220         """Return a dask container for this data source"""
--> 221         raise NotImplementedError
    222 
    223     def to_spark(self):

NotImplementedError: 

I assume that intake-stac is keying off the "type" field in the item field. Parquet doesn't have a mime-type, so I tried 'parquet' without success. I then re-read your Readme and realized that if intake-stac is built on top of intake-xarray, then you probably can't read in parquet regardless of what I put in the "type" field.

Would it be possible to add parquet via the intake-parquet library?

I'm wondering if parquet is beyond the scope of the STAC catalog spec? I don't see parquet in STAC's list of media types here. But then I don't see zarr either and I'm guessing that you support zarr with intake-stac because it's your favored data type for pangeo.

jhamman commented 4 years ago

Yes! We should totally be able to do this. We need to map the stac type to the intake-parquet driver. Here's where that would go:

https://github.com/intake/intake-stac/blob/d71b2d2b0ea2f8c89cb0310706c4de6d19406e17/intake_stac/catalog.py#L351-L363

Are you up for adding this feature?

christine-e-smit commented 4 years ago

I think I can handle adding one line to your drivers :)

But I'd think this would also require adding ingest-parquet as a dependency somewhere. Your top level requirements.txt, I assume?

And I'd need to add something to https://github.com/intake/intake-stac/blob/d71b2d2b0ea2f8c89cb0310706c4de6d19406e17/intake_stac/tests/test_catalog.py

wildintellect commented 4 years ago

Was looking over this during STAC sprint 6, currently updating types based on STAC Types

  1. What Media Type should we use for parquet considering it does not have a mimetype? Ideas application/parquet
  2. While were at it should we add all the Media Types that STAC supports? Maybe this is a different ticket to figure out additional formats intake needs like geojson.
jhamman commented 4 years ago

@wildintellect - if you are up for it, let's just do one PR where we update all the media types. I think application/parquet makes sense. I can help provide additional mappings to intake drivers as needed.