hytest-org / hytest

https://hytest-org.github.io/hytest/
22 stars 11 forks source link

Cloud file formats #24

Closed alaws-USGS closed 2 years ago

alaws-USGS commented 2 years ago

TITLE ADR 6: Cloud file formats

CONTEXT For data that needs to be cloud accessible and will be processed to be accessible for these tutorials, set file formats are important.

DECISION Data will be stored in the following formats: raster: NetCDF point (vector): xarray polygon (vector): geoparquet

STATUS Proposed

CONSEQUENCES Continuity in data formats will allow for easier input/output in the notebooks and will set clear guidance during data acquisition.

From Rich:

For python users, the format is not that important, since we have kerchunk/fsspec ReferenceFileSystem to unify access to any collection of scientific data formats. It's the chunk size and shape that determines the performance on the cloud.

But for non-Python users, it's convenient to use formats that they can read!

So yes, I would say the formats might be:

n-dimensional array data from sensors and models: NetCDF4 with chunk size between 10-200mb, with appropriate chunk shapes (fast for displaying maps at a specific time, not terribly slow for extracting time series at a point)

raster data: either NetCDF4 or Cloud-Optimized GeoTIFF

vector data: for now geodatabase, soon geoparquet

tabular data: NetCDF4 or parquet

alaws-USGS commented 2 years ago

Cloud file formats

Date: 2022-08-25

Status

Accepted

Context

For data that needs to be cloud accessible and will be processed to be accessible for these tutorials, set file formats are important.

Decision

Data will be stored in the following formats: n-dimensional array: NetCDF4 w/ chunk size between 10-200mb raster: NetCDF4 or cloud-optimized GeoTIFF vector: geodatabase (now), geoparquet (future) tabular: NetCDF4 or parquet

Consequences

Continuity in data formats will allow for easier input/output in the notebooks and will set clear guidance during data acquisition.