openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data
https://nowcasting-dataset.readthedocs.io/en/stable/
MIT License
24 stars 6 forks source link

Tweak the way DataSources are represented in the Configuration model #217

Closed JackKelly closed 2 years ago

JackKelly commented 2 years ago

Detailed Description

We maybe want to make a few small changes to the Configuration model:

Context

Splitting up the Configuration so each data source has its own section should mean that we don't have to write custom code to translate between a Configuration object and the constructor for each DataSource. Instead we just do something like SatDataSource(**config.sat_data)

peterdudfield commented 2 years ago

perhaps the layout could be config:

JackKelly commented 2 years ago

Yeah, exactly! To flesh that out one more level deep:

I'm not entirely sure about the default_forecast_minutes thing. Maybe it's conceptually simpler to just enter forecast_minutes and history_minutes separately for each data_source. The main aim is to allow us to have different lengths for different data sources (e.g. 4 hours of historical NWPs; and only 1 hour of historical satellite)

JackKelly commented 2 years ago

Also, it would be great if the field names in the config file exactly match the constructor arguments for each DataSource so, to create, say, a SatelliteDataSource, we can just do SatelliteDataSource(**config.data_sources.satellite)

Should we keep the field name zarr_path and change the DataSources to take a zarr_path argument? Or should we always use filename for all the DataSources? (Although PVDataSource requires two filenames: one for the PV timeseries data; the other for metadata.)

peterdudfield commented 2 years ago

Yea I like the lay out above, I'm not quite sure how to do the 'default_forecast_minutes' from a model higher up, but we could have a play.

In the idea of being verbose, I think keep it as zarr_path, and for PVDataSource, put both the filenames in>

We can add a validator to make sure the filename or zarr_path has the right suffix

JackKelly commented 2 years ago

I'm not quite sure how to do the 'default_forecast_minutes' from a model higher up

Yeah, I wonder if we should keep things simple and users just have to put the forecast_minutes into each DataSource?

BTW, are you happy to have a shot at this issue? Absolutely no worries either way - I'm happy to if you've got too much other stuff on this week!

peterdudfield commented 2 years ago

Yea - i'll give it a go

JackKelly commented 2 years ago

Cheers, thank you!

To give a bit more context... and don't necessarily worry about implementing this yet... but I guess the ultimate goal is that, in prepare_ml_data.py, we can have a single line like this:

data_sources = get_data_sources(config.data_sources)

And, get_data_sources() could look something like this:

def get_data_sources(config_for_all_data_sources) -> list[DataSource]:
    data_source_name_to_class = {'satellite': SatelliteDataSource, 'nwp': NWPDataSource, ...}
    data_sources = []
    for data_source_name, data_source_class in data_source_name_to_class.items():
        config_for_data_source = getattr(config_for_all_data_sources, data_source_name)
        data_source = data_source_class(**config_for_data_source)
        data_sources.append(data_source)
    return data_sources
peterdudfield commented 2 years ago

That would be very neat indeed