Tweak the way DataSources are represented in the Configuration model

openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data

https://nowcasting-dataset.readthedocs.io/en/stable/

MIT License

24 stars 6 forks source link

Tweak the way DataSources are represented in the Configuration model #217

Closed JackKelly closed 2 years ago

JackKelly commented 2 years ago

Detailed Description

We maybe want to make a few small changes to the Configuration model:

include a list of enabled_data_sources
maybe split the config up so each data source has its own section, so that entire section can be passed into the DataSource object. e.g. there would be an NWP section, with fields zarr_path, image_size, and channels.

Context

Splitting up the Configuration so each data source has its own section should mean that we don't have to write custom code to translate between a Configuration object and the constructor for each DataSource. Instead we just do something like SatDataSource(**config.sat_data)

peterdudfield commented 2 years ago

perhaps the layout could be config:

data_sources -- satellite -- nwp -- e.t.c
other subfields like Git

JackKelly commented 2 years ago

Yeah, exactly! To flesh that out one more level deep:

data_sources:
- default_forecast_minutes
- default_history_minutes
- satellite
  - zarr_path
  - channels
  - forecast_minutes # Optionally override default_forecast_minutes for Satellite data
  - history_minutes # Optionally override default_history_minutes for Satellite data
  - image_size_pixels
- nwp
  - zarr_path
  - channels
  - forecast_minutes # Optionally override default_forecast_minutes for NWP data
  - history_minutes # Optionally override default_history_minutes for NWP data
  - image_size_pixels
- etc.
git
process
- seed
- batch_size
- local_temp_path
output_data
etc.

I'm not entirely sure about the default_forecast_minutes thing. Maybe it's conceptually simpler to just enter forecast_minutes and history_minutes separately for each data_source. The main aim is to allow us to have different lengths for different data sources (e.g. 4 hours of historical NWPs; and only 1 hour of historical satellite)

JackKelly commented 2 years ago

Also, it would be great if the field names in the config file exactly match the constructor arguments for each DataSource so, to create, say, a SatelliteDataSource, we can just do SatelliteDataSource(**config.data_sources.satellite)

Should we keep the field name zarr_path and change the DataSources to take a zarr_path argument? Or should we always use filename for all the DataSources? (Although PVDataSource requires two filenames: one for the PV timeseries data; the other for metadata.)

peterdudfield commented 2 years ago

Yea I like the lay out above, I'm not quite sure how to do the 'default_forecast_minutes' from a model higher up, but we could have a play.

In the idea of being verbose, I think keep it as zarr_path, and for PVDataSource, put both the filenames in>

We can add a validator to make sure the filename or zarr_path has the right suffix

JackKelly commented 2 years ago

I'm not quite sure how to do the 'default_forecast_minutes' from a model higher up

Yeah, I wonder if we should keep things simple and users just have to put the forecast_minutes into each DataSource?

BTW, are you happy to have a shot at this issue? Absolutely no worries either way - I'm happy to if you've got too much other stuff on this week!

peterdudfield commented 2 years ago

Yea - i'll give it a go

JackKelly commented 2 years ago

Cheers, thank you!

To give a bit more context... and don't necessarily worry about implementing this yet... but I guess the ultimate goal is that, in prepare_ml_data.py, we can have a single line like this:

data_sources = get_data_sources(config.data_sources)

And, get_data_sources() could look something like this:

def get_data_sources(config_for_all_data_sources) -> list[DataSource]:
    data_source_name_to_class = {'satellite': SatelliteDataSource, 'nwp': NWPDataSource, ...}
    data_sources = []
    for data_source_name, data_source_class in data_source_name_to_class.items():
        config_for_data_source = getattr(config_for_all_data_sources, data_source_name)
        data_source = data_source_class(**config_for_data_source)
        data_sources.append(data_source)
    return data_sources

peterdudfield commented 2 years ago

That would be very neat indeed