support parquet files in catalog

christine-e-smit commented 4 years ago

As I said in #48, I was recently involved in group trying to use intake-stac with some data we have sitting in s3. This data is in parquet format. I've used intake-parquet on this data with no problem to get a dask data frame. But when I try with intake-stac,

import intake
from intake import open_stac_catalog
cat = open_stac_catalog('https://not.the.real.url/catalog.json')
df = cat["AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet"].get().to_dask()

I get the error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-10-25d227182f13> in <module>
----> 1 df = cat["AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet"].get().to_dask()

~/Software/python3/envs/stac/lib/python3.8/site-packages/intake/source/base.py in to_dask(self)
    219     def to_dask(self):
    220         """Return a dask container for this data source"""
--> 221         raise NotImplementedError
    222 
    223     def to_spark(self):

NotImplementedError:

I assume that intake-stac is keying off the "type" field in the item field. Parquet doesn't have a mime-type, so I tried 'parquet' without success. I then re-read your Readme and realized that if intake-stac is built on top of intake-xarray, then you probably can't read in parquet regardless of what I put in the "type" field.

Would it be possible to add parquet via the intake-parquet library?

I'm wondering if parquet is beyond the scope of the STAC catalog spec? I don't see parquet in STAC's list of media types here. But then I don't see zarr either and I'm guessing that you support zarr with intake-stac because it's your favored data type for pangeo.

jhamman commented 4 years ago

Yes! We should totally be able to do this. We need to map the stac type to the intake-parquet driver. Here's where that would go:

https://github.com/intake/intake-stac/blob/d71b2d2b0ea2f8c89cb0310706c4de6d19406e17/intake_stac/catalog.py#L351-L363

Are you up for adding this feature?

christine-e-smit commented 4 years ago

I think I can handle adding one line to your drivers :)

But I'd think this would also require adding ingest-parquet as a dependency somewhere. Your top level requirements.txt, I assume?

And I'd need to add something to https://github.com/intake/intake-stac/blob/d71b2d2b0ea2f8c89cb0310706c4de6d19406e17/intake_stac/tests/test_catalog.py

wildintellect commented 4 years ago

Was looking over this during STAC sprint 6, currently updating types based on STAC Types

What Media Type should we use for parquet considering it does not have a mimetype? Ideas application/parquet
While were at it should we add all the Media Types that STAC supports? Maybe this is a different ticket to figure out additional formats intake needs like geojson.

jhamman commented 4 years ago

@wildintellect - if you are up for it, let's just do one PR where we update all the media types. I think application/parquet makes sense. I can help provide additional mappings to intake drivers as needed.

intake / intake-stac

support parquet files in catalog #50