Add `smif prepare` command to cli to convert data to use binary filestore

tomalrussell commented 5 years ago

smif prepare --from=csv --to=binary <model_run> imports scenario and parameter data provided using the csv data store and writes out using the binary data store

This will allow 'pure' implementations of the DataStore classes to retrieve data for a model run.

tomalrussell commented 5 years ago

From wiki: https://github.com/nismod/smif/wiki/controller

tlestang commented 5 years ago

smif prepare-convert model_run_name converts model run data from csv to Parquet or parquet to csv depending on the local interface. For now only tested in the former case.

model run data is composed of:

strategies interventions data
scenario data - for all variants
narrative data - for all variants
initial conditions, default parameters and interventions data

In most of cases, a file data_filename.parquet is generated alongside the original data_filename.csv, except for initial conditions and interventions data files (see below)

store.read_interventions() reads all intervention files listed in the sector model config file and aggregate them into a single dictionary. Therefore store.write_interventions() only writes 1 single file instead of the two original data files.
store.write_interventions() generates a file named after the model name and appends the .csv extension, i.e. model_name.csv. The .csv extension is replaced by the correct extension in the data store method. So if model energy_demand references 2 intervention data files energy_supply.csv and energy_supply_alt.csv then one gets 1 file energy_demand.parquet
same goes for store.write_initial_conditions() that names the data file after the sector model name. For instance reservoirs.csv is converted in energy_demand.parquet and water_supply_oxford.csv is converted in water_supply.parquet

So after using the prepare-convert command, one must modify the config files accordingly, which I think is not desirable. It may not be too difficult to name the parquet files after the original files, but probably involves messing with the file store and break things ?

tlestang commented 5 years ago

Just realized that instead of converting all variants for scenarios and narratives, it is probably more consistent to convert only those used in the prescribed model run.

tomalrussell commented 5 years ago

I think we should avoid merging the multiple interventions files, multiple initial conditions.

Use-cases for multiple files:

set of columns may vary between kinds of interventions (though schemas are not explicitly defined)
sets of initial conditions may be mixed-and-matched between model runs

Possible approach:

change ConfigStore to use string ids for datasets instead of filepaths
add an index (YAML file/sqlite with single table) to the base of a data store to provide lookup (string id => relative filepath)
add some lower-level methods on Store/DataStore to read/write separate interventions/initial conditions files using string ids

tlestang commented 5 years ago

I don't really understand what you mean by

change ConfigStore to use string ids for datasets instead of filepaths

Could you give an example ?

For now I opted for a very naive approach, implementing methods on Store/DataStore to read/write interventions for a specific file. The filename is still provided by the ConfigStore

model = store.read_model('energy_demand')
for intervention_file in model['interventions']:
    interventions = csv_store.read_interventions_file(intervention_file)
    parquet_store.write_interventions_file(intervention_file, interventions)

with for instance in the DataStore

    def read_interventions_file(self, key):
        path = os.path.join(self.data_folders['interventions'], key)
        return self._read_list_of_dicts(path)

    def write_interventions_file(self, key, interventions):
        path = os.path.join(self.data_folders['interventions'], key)
        path = self._set_file_extension(path)
        self._write_list_of_dicts(path, interventions)

tomalrussell commented 5 years ago

This looks like a simpler approach to get started - but do we still need to edit the config files eventually?

e.g. start with a model.yml something like:

name: energy_model
interventions:
  - energy_supply.csv
  - energy_supply_alt.csv

then run prepare-convert, which picks up the two csv files and outputs two parquet files, energy_supply.parquet and energy_supply_alt.parquet

then run the model using the parquet data store; the path to interventions files are still *.csv, will it fail to read them? automagically try a different extension?

By "change ConfigStore to use string ids for datasets instead of filepaths" I meant adding a layer of indirection, so the config yamls don't know directly about any file paths for the data store.

change the model.yml to something like:

name: energy_model
interventions:
  - energy_supply
  - energy_supply_alt

add a csv_store_files.csv at the base of the data folder:

key,path
energy_supply,energy_supply.csv
energy_supply_alt,energy_supply_alt.csv

then when writing/converting to parquet, add a parquet_store_files.csv (or .parquet?!):

key,path
energy_supply,energy_supply.parquet
energy_supply_alt,energy_supply_alt.parquet

this means a breaking change to current configs, but would decouple the configs from the data store implementation

tomalrussell commented 5 years ago

Closed by #383

nismod / smif

Add `smif prepare` command to cli to convert data to use binary filestore #319