Closed JackKelly closed 2 years ago
perhaps the layout could be config:
data_sources -- satellite -- nwp -- e.t.c
other subfields like Git
Yeah, exactly! To flesh that out one more level deep:
I'm not entirely sure about the default_forecast_minutes
thing. Maybe it's conceptually simpler to just enter forecast_minutes
and history_minutes
separately for each data_source. The main aim is to allow us to have different lengths for different data sources (e.g. 4 hours of historical NWPs; and only 1 hour of historical satellite)
Also, it would be great if the field names in the config file exactly match the constructor arguments for each DataSource
so, to create, say, a SatelliteDataSource
, we can just do SatelliteDataSource(**config.data_sources.satellite)
Should we keep the field name zarr_path
and change the DataSources
to take a zarr_path
argument? Or should we always use filename
for all the DataSources
? (Although PVDataSource
requires two filenames: one for the PV timeseries data; the other for metadata.)
Yea I like the lay out above, I'm not quite sure how to do the 'default_forecast_minutes' from a model higher up, but we could have a play.
In the idea of being verbose, I think keep it as zarr_path, and for PVDataSource, put both the filenames in>
We can add a validator to make sure the filename or zarr_path has the right suffix
I'm not quite sure how to do the 'default_forecast_minutes' from a model higher up
Yeah, I wonder if we should keep things simple and users just have to put the forecast_minutes
into each DataSource
?
BTW, are you happy to have a shot at this issue? Absolutely no worries either way - I'm happy to if you've got too much other stuff on this week!
Yea - i'll give it a go
Cheers, thank you!
To give a bit more context... and don't necessarily worry about implementing this yet... but I guess the ultimate goal is that, in prepare_ml_data.py
, we can have a single line like this:
data_sources = get_data_sources(config.data_sources)
And, get_data_sources()
could look something like this:
def get_data_sources(config_for_all_data_sources) -> list[DataSource]:
data_source_name_to_class = {'satellite': SatelliteDataSource, 'nwp': NWPDataSource, ...}
data_sources = []
for data_source_name, data_source_class in data_source_name_to_class.items():
config_for_data_source = getattr(config_for_all_data_sources, data_source_name)
data_source = data_source_class(**config_for_data_source)
data_sources.append(data_source)
return data_sources
That would be very neat indeed
Detailed Description
We maybe want to make a few small changes to the Configuration model:
enabled_data_sources
NWP
section, with fields zarr_path, image_size, and channels.Context
Splitting up the Configuration so each data source has its own section should mean that we don't have to write custom code to translate between a Configuration object and the constructor for each DataSource. Instead we just do something like
SatDataSource(**config.sat_data)