leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 5 forks source link

New Dataset [Microsoft Subseasonal Data] #18

Closed AlexandreRebiere closed 8 months ago

AlexandreRebiere commented 1 year ago

Dataset Name

Microsoft Subseasonal Data

Dataset URL

https://github.com/microsoft/subseasonal_data/blob/main/DATA.md

Description

The SubseasonalClimateUSA dataset houses a diverse collection of ground-truth measurements and model forecasts relevant to forecasting at subseasonal timescales. It is at the root of another forecasting model we want to work on to analyse its performances.

Size

Total size of 175 GB

License

Unknown

Data Format

HDF

Data Format (other)

.h5 or .feather

Access protocol

HTTP(S)

Source File Organization

The dataset is organized as a collection of Python Pandas DataFrames and Series objects stored in HDF5 format (via pandas.DataFrame.to_hdf or pandas.Series.to_hdf) or feather format (via pandas.DataFrame.to_feather or pandas.Series.to_feather), with one .h5 or .feather file per DataFrame or Series.

Each HDF5 file contributes data variables (features or target values) falling into one of three categories: (i) spatial (varying with the target grid point but not the target date); (ii) temporal (varying with the target date but not the target grid point); (iii) spatiotemporal (varying with both the target grid point and the target date).

Unless otherwise noted below, temporal and spatiotemporal variables arising from daily data sources were derived by averaging input values over each 14-day period, and spatial and spatiotemporal variables were derived by interpolating input data to a 1° × 1° latitude-longitude grid using the Climate Data Operators operator remapdis (distance-weighted average interpolation) with target grid r360x181 and retaining only the grid points belonging to the contiguous United States.

Example URLs

https://github.com/microsoft/subseasonal_toolkit

Authorization

No; data are fully public

Transformation / Processing

We will work on those datas without any modifications (at the beginning at least)

Target Format

Zarr

Comments

No response

jbusecke commented 1 year ago

@AlexandreRebiere can you help me understand the details of this dataset? Is this data updated in regular intervals or can the data be considered 'fixed'?

AlexandreRebiere commented 1 year ago

For our internship, it is ok to consider this dataset as "fixed", it is not necessary for the model to have the latest datas. Nevertheless, the management of this dataset is not urgent, we have a lot to do at the beginning with rodeo_forecast, we might have to try this new dataset only in a few weeks.