openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data
https://nowcasting-dataset.readthedocs.io/en/stable/
MIT License
24 stars 6 forks source link

Use independent processes for each "modality" #202

Closed JackKelly closed 2 years ago

JackKelly commented 2 years ago

This issue is split from #166

Detailed Description

We could have separate files for each data source, for each batch.

For example, on disk, within the prepared_ML_data/train/ directory, we might have train/NWP/, train/satellite/, etc. And, as before, in each of these folders, we'd have one file per batch, identified by the batch number. And, importantly, train/NWP/1.nc and train/satellite/1.nc would still be perfectly aligned in time and space (just as they currently are).

Saving each "modality" as a different set of files opens up the possibility to further modularise and de-couple nowcasting_dataset

prepare_ml_data.py could run through each modality separately, something like:

  1. Randomly sample the "positions" in time and space for each ML training example, and save to disk. (In a little more detail: Find all the available t0_datetimes from across all the DataSources (see #204). Randomly sample from these; and randomly sample from the available locations... This should be general enough to enable #93)
  2. Fire up a separate process for each modality (probably using futures.ProcessPoolExecutor). We could even have multiple processes per modality, where each process works on a different subset of the "positions" (e.g. if we want 4 processes for each modality, then split the "positions" list into quarters).
  3. Each process will read from the previously-saved "positions", and save pre-prepared batches to disk for that modality.

By default, prepare_ml_data.py should create all modalities specified in the config yaml file. But the user should be able to pass in a command-line argument (#171) to only re-recreate one or a subset of modalities (e.g. if we fix a bug in the creation of batches of satellite data, and we only want to re-computed the satellite data).

Advantages:

Disadvantages:

Subtasks, in sequence:

  1. [ ] Pre-prepare the "plan" and save it to disk (before processing any data) (possibly do #204 at the same time, if it makes #202 easier). Then, load the plan from disk and proceed as the code currently works.
  2. [ ] Implement DataSource.prepare_batch(t0_datetimes, x_centers, y_centers, dst_path) which does everything :) It loads a batch from the source data, selects the approprate times and spatial positions, and writes the batch to disk (this solves #212). prepare_ml_data.py will read the entire pre-prepared "plan", and fire up a process (using ProcessPoolExecutor()) for each modality.
  3. [ ] Remove the code that combines batches from each DataSource into a single batch
  4. [ ] Simplify the public interface to DataSource: now that we're not combing data from different modalities, the data never needs to leave the DataSource. You could imagine that each DataSource only needs to expose two or three public methods: get_available_t0_datetimes(history_minutes, forecast_minutes), sample_locations_for_datetimes(t0_datetimes) , and prepare_batch(t0_datetimes, center_x, center_x)
  5. [ ] Remove any unused functions (and their tests).
JackKelly commented 2 years ago

I'll start work on step 1 (pre-prepare a "plan") this afternoon :)

peterdudfield commented 2 years ago

Ill adjust

I'll think about adding

JackKelly commented 2 years ago

A more up-to-date, and more complete sketch of the design discussed in this issue is here: https://github.com/openclimatefix/nowcasting_dataset/issues/213#issuecomment-940153782