validate data types in submission files

elray1 commented 1 year ago

these should be consistent with the hub's schema, e.g. any task ids that are set up to be integers by the hub's tasks.json config should be encoded as an integer data type in the parquet file.

annakrystalli commented 1 year ago

Yes! Validating this will be through check function check_tbl_col_types() (WIP).

However, its csv format that is a bit more problematic as it does not contain metadata and default read behaviour can deviate from what is expected from the schema without the file being wrong.

Say for example the value column should be double, but a given file contains only integers in value so when the file is read in, value would be read as integer by default and would fail the data type check, but there is nothing wrong with the file.

We could try to coerce to the correct type and if that succeeds, call that file valid (as open_dataset() would be able to convert it to the appropriate data type when accessing data. However, that means that numeric values stored as characters could also pass a coercion test when we would clearly want it to fail as we'd be asking for trouble allowing the potential for mixing characters in some files and numeric in others for the same column.

Currently, I've implemented argument use_hub_schema in read_model_out_file() fn to allow us to potentially address this during read time: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/697f75a8e2186a134fc8ac3ab903bd9b9471c742/R/read_model_out_file.R#L25C5-L37C7 but the more I think about it, this would check data types at the wrong stage (not in it's dedicate check_tbl_data_types() fn) and is also just as prone to the erroneous validation due to successful character -> numeric coercion just mentioned.

Currently the only option I see is to only check only the special case of integer to double and treat all other data type mismatches as validation failures but open to better suggestions!

elray1 commented 1 year ago

mostly this makes sense to me -- just a brief comment r.e. this piece: "However, that means that numeric values stored as characters could also pass a coercion test when we would clearly want it to fail as we'd be asking for trouble allowing the potential for mixing characters in some files and numeric in others for the same column."

My understanding is that in csv files, "0.1" is a valid way to encode the numeric value 0.1. Or, maybe more precisely put, it is my understanding that csv doesn't encode data types and it allows for the use of quotes as a field delimiter, so the following two csv files contain exactly the same information:

location,output_type,output_type_id,value
US,quantile,0.5,1000

"location","output_type","output_type_id","value"
"US","quantile","0.5","1000"

So, for csv formatted data, I do think this "coercion test" is a reasonable way to go. I think that it's only in parquet files that the data type is actually encoded, so only for those files that we can do these type checks more directly.

hubverse-org / hubValidations

validate data types in submission files #12