Closed annakrystalli closed 1 year ago
@LucieContamin & @elray1 , can you have a quick look at the manual checks listed above and confirm I've not missed or got sth wrong?
I have a question about the "can be coerced to type". For type integer, we can coerced float to integer in R without any issue but we might want the validation to still return an error. Will the validation will take that behavior into account?
I can't think of anything else
I have a question about the "can be coerced to type". For type integer, we can coerced float to integer in R without any issue but we might want the validation to still return an error. Will the validation will take that behavior into account?
So the reason we need to check this is because there is always the possibility that values for multiple output types are stacked in the single value
column and coerced to the lowest common denominator data type (recall similar discussions we had about output_type_id
column). So if we have two output types in a file, one which accepts integers and the other double (number in json), in the value
column they would all be stored as double, but that shouldn't trigger a failure as there is no other way to mix those two data types in a single column.
The actual check splits the submission file into separate output types (and in fact modeling tasks too to match data to exactly to the correct config section) and checks that each chunk can be coerced to the data type specified in the config for a given output type. It's the only way really to test against the value
elements in the config. What I mainly envisage this check catching though is stray characters etc.
Does that make more sense?
Ah yes sorry, my question was not clear.
I understand why we need it, the issue I have is if for example we have a value that is 3.1
where it's suppose to be integer. In R, you can coerce it to integer with as.integer()
for example and it will returns 3
without message, warning or error message.
So, my question was how do we deal with that?
Does my question make more sense?
Yes! That makes total sense. Sorry for the confusion. And yes you are right, that should totally fail.
In my view it should only fail in the case where what should be an integer contains info in the decimal part of the value which in therefore lost when coerced to integer. If coercion to integer yields the same effective value (i.e. 3.0 == 3L) that should be expected behaviour and a pass. Will implement.
Fixed in e3a65dea9a72e534ba4da764eaaf4277fa916cca.
Also helped detect error in hubUtils flusight example hub files! e.g https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/main/inst/testhubs/flusight/forecasts/hub-ensemble/2023-04-24-hub-ensemble.csv#L1-L4
A number of checks required, some according to output type.
check_tbl_value_col
: Some of the variability in checks is captured by the limitations of the schema so a check against thetasks
config file will suffice. This includes:Additional manual checks required
CDF/quantile
check_tbl_value_col_ascending
: for each individual prediction, entries in value must be non-decreasing as output_type_id increase.pmf
check_tbl_value_col_sum1
values must add up to 1