hubverse-org / hubValidations

Testing framework for hubverse hub validations
https://hubverse-org.github.io/hubValidations/
Other
1 stars 4 forks source link

Create `check_tbl_values_col` function #19

Closed annakrystalli closed 1 year ago

annakrystalli commented 1 year ago

A number of checks required, some according to output type.

Additional manual checks required

CDF/quantile

pmf

annakrystalli commented 1 year ago

@LucieContamin & @elray1 , can you have a quick look at the manual checks listed above and confirm I've not missed or got sth wrong?

LucieContamin commented 1 year ago

I have a question about the "can be coerced to type". For type integer, we can coerced float to integer in R without any issue but we might want the validation to still return an error. Will the validation will take that behavior into account?

elray1 commented 1 year ago

I can't think of anything else

annakrystalli commented 1 year ago

I have a question about the "can be coerced to type". For type integer, we can coerced float to integer in R without any issue but we might want the validation to still return an error. Will the validation will take that behavior into account?

So the reason we need to check this is because there is always the possibility that values for multiple output types are stacked in the single value column and coerced to the lowest common denominator data type (recall similar discussions we had about output_type_id column). So if we have two output types in a file, one which accepts integers and the other double (number in json), in the value column they would all be stored as double, but that shouldn't trigger a failure as there is no other way to mix those two data types in a single column.

The actual check splits the submission file into separate output types (and in fact modeling tasks too to match data to exactly to the correct config section) and checks that each chunk can be coerced to the data type specified in the config for a given output type. It's the only way really to test against the value elements in the config. What I mainly envisage this check catching though is stray characters etc.

Does that make more sense?

LucieContamin commented 1 year ago

Ah yes sorry, my question was not clear. I understand why we need it, the issue I have is if for example we have a value that is 3.1 where it's suppose to be integer. In R, you can coerce it to integer with as.integer() for example and it will returns 3 without message, warning or error message.

So, my question was how do we deal with that?

Does my question make more sense?

annakrystalli commented 1 year ago

Yes! That makes total sense. Sorry for the confusion. And yes you are right, that should totally fail.

In my view it should only fail in the case where what should be an integer contains info in the decimal part of the value which in therefore lost when coerced to integer. If coercion to integer yields the same effective value (i.e. 3.0 == 3L) that should be expected behaviour and a pass. Will implement.

annakrystalli commented 1 year ago

Fixed in e3a65dea9a72e534ba4da764eaaf4277fa916cca.

Also helped detect error in hubUtils flusight example hub files! e.g https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/main/inst/testhubs/flusight/forecasts/hub-ensemble/2023-04-24-hub-ensemble.csv#L1-L4