Create `check_tbl_values_col` function

annakrystalli commented 1 year ago

A number of checks required, some according to output type.

[x] check_tbl_value_col: Some of the variability in checks is captured by the limitations of the schema so a check against the tasks config file will suffice. This includes:
- mean/median/quantile/sample(current implementation):
- type: can be coerced to type (req)
- min: optional
- max: optional
- cdf/pmf:
- type: can be coerced to type (req)
- min: 0 (req)
- max: 1 (req)

Additional manual checks required

CDF/quantile

[x] check_tbl_value_col_ascending: for each individual prediction, entries in value must be non-decreasing as output_type_id increase.

pmf

[x] check_tbl_value_col_sum1 values must add up to 1

annakrystalli commented 1 year ago

@LucieContamin & @elray1 , can you have a quick look at the manual checks listed above and confirm I've not missed or got sth wrong?

LucieContamin commented 1 year ago

I have a question about the "can be coerced to type". For type integer, we can coerced float to integer in R without any issue but we might want the validation to still return an error. Will the validation will take that behavior into account?

elray1 commented 1 year ago

I can't think of anything else

annakrystalli commented 1 year ago

I have a question about the "can be coerced to type". For type integer, we can coerced float to integer in R without any issue but we might want the validation to still return an error. Will the validation will take that behavior into account?

So the reason we need to check this is because there is always the possibility that values for multiple output types are stacked in the single value column and coerced to the lowest common denominator data type (recall similar discussions we had about output_type_id column). So if we have two output types in a file, one which accepts integers and the other double (number in json), in the value column they would all be stored as double, but that shouldn't trigger a failure as there is no other way to mix those two data types in a single column.

The actual check splits the submission file into separate output types (and in fact modeling tasks too to match data to exactly to the correct config section) and checks that each chunk can be coerced to the data type specified in the config for a given output type. It's the only way really to test against the value elements in the config. What I mainly envisage this check catching though is stray characters etc.

Does that make more sense?

LucieContamin commented 1 year ago

Ah yes sorry, my question was not clear. I understand why we need it, the issue I have is if for example we have a value that is 3.1 where it's suppose to be integer. In R, you can coerce it to integer with as.integer() for example and it will returns 3 without message, warning or error message.

So, my question was how do we deal with that?

are we coercing all numeric to integer and don't return any message, warning, etc
OR we accept to coerce it to integer but returns a message or warning
OR do we not coerce it and return an error

Does my question make more sense?

annakrystalli commented 1 year ago

Yes! That makes total sense. Sorry for the confusion. And yes you are right, that should totally fail.

In my view it should only fail in the case where what should be an integer contains info in the decimal part of the value which in therefore lost when coerced to integer. If coercion to integer yields the same effective value (i.e. 3.0 == 3L) that should be expected behaviour and a pass. Will implement.

annakrystalli commented 1 year ago

Fixed in e3a65dea9a72e534ba4da764eaaf4277fa916cca.

Also helped detect error in hubUtils flusight example hub files! e.g https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/main/inst/testhubs/flusight/forecasts/hub-ensemble/2023-04-24-hub-ensemble.csv#L1-L4

hubverse-org / hubValidations

Create `check_tbl_values_col` function #19

Additional manual checks required

CDF/quantile

pmf