pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
126 stars 54 forks source link

CSV to Parquet recipe #94

Open rabernat opened 3 years ago

rabernat commented 3 years ago

So far we basically only have NetCDF (or other things that Xarray can read; e.g. Grib) to Zarr recipes.

Some recipes will want to work with tabular data, e.g. transforming a collections of CSVs to Parquet. (Example: https://github.com/pangeo-forge/staged-recipes/issues/3)

This will require an entirely new recipe class. Creating this class will force us to refactor the recipe module significantly. This will be laborious but hopefully relatively straightforward.

cisaacstern commented 2 years ago

I'm sitting here with @einatlev-ldeo at the EarthCube Annual Meeting in La Jolla. We are discussing if/how we may be able to provide cloud-optimized access to (at least some subset of) the data provided on

via Pangeo Forge.

Based on our discussions, it seems that this may be a great use case for a Parquet recipe. It strikes me that once we complete the work scoped in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/376, the possibility of writing a Parquet recipe is perhaps quite approachable (as really just few additional PTransforms).

While we're waiting for the first phase Beam work to complete, perhaps we can start brainstorming what data objects would make sense to assemble from these raw data. For example, are there a set(s) of variables with the same time resolution, which would be able to fit all in a single large table together.? If so, what are those variables and their access paths on the file server? Can we assemble a demonstration CSV from them using a simple standalone Python script? If so, that would be a very useful basis for building a larger table with Pangeo Forge.

Side note: there's some awesome webcam data available through the same project. I wonder what ARCO format might be suitable for webcam time series data?

TomAugspurger commented 2 years ago

Just FYI, I have some notes on how we think about tabular data for the Planetary Computer: https://gist.github.com/TomAugspurger/457a2288f6ef7490ab87546faf665e14

cisaacstern commented 2 years ago

Thanks Tom this is great

einatlev-ldeo commented 2 years ago

Thank you!

Sent from my iPhone

On Jun 15, 2022, at 6:25 AM, Tom Augspurger @.***> wrote:

 Just FYI, I have some notes on how we think about tabular data for the Planetary Computer: https://gist.github.com/TomAugspurger/457a2288f6ef7490ab87546faf665e14

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.