imperialCHEPI / healthgps

Global Health Policy Simulation model (Health-GPS)
https://imperialchepi.github.io/healthgps/
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

Remove unnecessary fields in `inputs`/`dataset` section of config files #485

Open alexdewar opened 1 month ago

alexdewar commented 1 month ago

This section currently looks something like this:

"inputs": {
    "dataset": {
        "name": "India.DataFile.csv",
        "format": "csv",
        "delimiter": ",",
        "encoding": "ASCII",
        "columns": {
            "Gender": "integer",
            "Age": "integer",
            "Sector": "integer",
            "Income": "integer",
            "Carbohydrate": "double",
            "Protein": "double",
            "Sodium": "double",
            "Fat": "double",
            "PhysicalActivity": "double",
            "EnergyIntake": "double",
            "Weight": "double",
            "Height": "double"
        }
    }

This seems a bit bloated and I'm wondering if we can cut it down (which would simplify #449).

Some thoughts after going through the examples repo:

Any thoughts @jamesturner246?

alexdewar commented 1 month ago

See also: https://github.com/imperialCHEPI/healthgps-examples/pull/12

alexdewar commented 1 month ago

Just to add, one alternative to removing these fields would be to make some/al of them constants for now. We could enforce this with a schema. If we add support for changing them to HGPS at some point in the future, we can update the schema accordingly.

That said, I'd still be inclined to remove delimiter and encoding, because they seem pointless.

jamesturner246 commented 1 month ago

Things like format and encoding can go.

If we remove `delimiter, I wonder if the CSV lib we use is smart enough to figure out European decimal points? Had all sorts of trouble when Ali was feeding me mixed period/comma config CSVs.

I'm a bit hesitant about inferring column type from the files, as I've seen mixed 1.0s in integer columns and 1s in float columns. If we start passing in floats to code that expects ints, I wonder if things start breaking. But on the other hand I agree it does look bloated.

I think as long as it doesn't cause problems, go for it.

alexdewar commented 3 weeks ago

If we remove `delimiter, I wonder if the CSV lib we use is smart enough to figure out European decimal points? Had all sorts of trouble when Ali was feeding me mixed period/comma config CSVs.

I think the answer there is to only allow commas as delimiters and dots as decimal points. If we open the door to users providing the data in weird formats, then we make everyone's life harder.

I'm a bit hesitant about inferring column type from the files, as I've seen mixed 1.0s in integer columns and 1s in float columns. If we start passing in floats to code that expects ints, I wonder if things start breaking. But on the other hand I agree it does look bloated.

I was assuming that these CSV files always have the same columns -- is that not the case? If not, then it does make sense to leave things as is. I was hoping to be able to use table schemas to validate CSV files (which similarly indicate the data type of each column), but obviously that will only work if we're expecting CSV files to be in a specific format.

jamesturner246 commented 3 weeks ago

Yeah, columns can be added, for instance. Come to think of it, I think these India.Datafile.csv and France.Datafile.csv are only ever used to print the initial status table at the beginning, and are not actually used from then on. This caused confusion and extra work in the past, where they were looking for data for this, but it wasn't actually used beyond the initial status message.

alexdewar commented 3 weeks ago

That's a bit strange...

So, to be clear, are you saying that there are:

  1. Some columns which must be present and whose type we know in advance
  2. And possibly extra columns, which aren't used anywhere (and therefore whose type doesn't matter)?

If so, we could define table schemas which require certain columns to be present, but allowing for arbitrary extra columns, if the user wants.

jamesturner246 commented 3 weeks ago

I mean that the whole file is used only in displaying the initial stat table, I think. Never used elsewhere, 97% sure.