openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data
https://nowcasting-dataset.readthedocs.io/en/stable/
MIT License
24 stars 6 forks source link

In `prepare_ml_data.py`, maybe don't allow `create_files_specifying_spatial_and_temporal_locations_of_each_example` to be run if a subset of `data_sources` is passed in at the command line #323

Open JackKelly opened 2 years ago

JackKelly commented 2 years ago

The issue is that, if the files specifying the spatial and temporal locations of each example are computed with less DataSources than the number of DataSources used to create batches, then we're likely to attempt to sample from locations that don't exist in at least one datasource.

Maybe the mechanism should be:

By default, if prepare_ml_data.py is called with at least one --data_source command line argument, and if the files specifying locations don't exist, then throw an error. But allow users to overwrite this behaviour with a --force_creation_of_locations flag, or something like that??

JackKelly commented 2 years ago

And/or, check against start_time and end_time (after those have been implemented by issue #425) - as suggested by Peter in comment https://github.com/openclimatefix/nowcasting_dataset/issues/439#issuecomment-972728948