Provide free-usage example data

raehik commented 1 year ago

The data is currently hosted on Pangeo. This is not the training data as such as some pre-processing is available. If someone wants to retrain then we need top provide guidance on how to obtain this and process it. We don't need to necessarily move the data from where it is currently hosted.

The dataset referenced in cmip26.py refers to a dataset which is in a requester-pays Google Cloud bucket. Nice for the host, but extremely inconvenient for users (especially without an explanation -- gcloud's error message is very confusing).

Ideally, we provide access to example data to use with this software. There are two separate types of data that seem required:

Simple example data for testing, asserting the software is working (small)
Training data (large) - provide just guidance on how to obtain and process this.

raehik commented 1 year ago

Related: The process of obtaining GCP creds to use a requester pays bucket isn't all that straightforward, even as a previous GCP user. Should we provide some notes on what to do there?

raehik commented 1 year ago

The GCP creds stuff was added in #60 -- worked for onboarding @MarionBWeinzierl .

raehik commented 1 year ago

Related: we want to place the processed data (output of data processing step) on Hugging Face. That'd be a nice first step. See #74 .

raehik commented 1 year ago

Tentatively closed along with #74 .

m2lines / gz21_ocean_momentum

Provide free-usage example data #21