Unexpected error about `output_type_id` data type with FluSight submission

elray1 commented 9 months ago

Validating the attached file yields the following error:

! 2023-10-14-UMass-gbq_bootstrap.csv: Column data types do not match hub schema.  `output_type_id ` should be "character " not "double "

I don't think this errors makes sense, since that column could be read in with a character type; a data type issue here is more an issue with how the file was read in rather than how the data was encoded.

annakrystalli commented 9 months ago

Given the overall hub's output_type_id will be character I would personally recommend staying on the safe side of the arrow package and encoding output_type_id as character in all files as a way to avoid potential unexpected problems down the road with accessing data through queries.

Having said that, I see what you mean...even if the column was encoded as character, by default it is read in as double... hmmm

elray1 commented 9 months ago

Essentially, csv files don't have a way to encode data types. You can wrap field values in quotes, but that's a part of the specification for delimiting fields, not a way to say whether it's a character or a numeric -- and any field may be quoted. So both

...,0.100,...

and

...,"0.100",...

are valid representations of both the numeric value 0.1 and the string value "0.100", and there is no way to tell from the file contents alone (unless some other value in the same column and a different row is clearly a string, in which case R will figure that the column as a whole must be a string data type).

annakrystalli commented 9 months ago

Right. So while this check should work for parquet and arrow files, I've been struggling to think what this should entail for CSV files.

I guess we could try coercing the tbl to the schema and if any problems arise, the file essentially fails this check? It does imply that we are checking slightly different things for different file types (and hence will return different error messages for each).

In any case, given its turned up already in a submission (see Rebecca's email), I guess we have to do sth different, even if temporarily until a more robust approach is decided?

elray1 commented 9 months ago

Yeah. One note is that I think if possible, we likely want to specify the data types at the time the csv is read in rather than coercing it after reading it in. For instance, in this case we would like to check that the output_type_ids are strings like "0.010", but if we first read csvs with conversion to numeric and then try to convert back to string we may end up with "0.01" which would then not match the required/optional value possibilities which are formatted with 3 digits.

Might there be a way to do something where we just try to read the files specifying a schema with expected data types, and see if that succeeds or throws errors?

annakrystalli commented 9 months ago

So I've actually included such an option in the read_model_out_file function for CSVs. The only issue is that if for some reason the coercion doesn't work on read then it's the check that the file can be read that will fail instead. Not sure if that is an real problem or not but worth being aware of it.

elray1 commented 9 months ago

Hmmm.

Maybe a temporary solution could be to just go to using that option and coarser check for now?

And then I don't know exactly, but:

that check might be good enough, if it provides detailed feedback about in what way a failed read failed?
or maybe it would be helpful to think about some kind of multi-stage thing, e.g. where we first try the existing read method, check what we can based on that (including things like -- expected column names, well-formatted csv with same number of columns in each row, etc.), and then if that looks good go to the more formal read_model_out_file method for reading in again? A bit awkward, but possibly more robust?

annakrystalli commented 8 months ago

Fixed this iin #55 for now.

Not that is there is a major problem with the data that does not allow the schema to be applied correctly in csvs, the check_file_read fails with a semi informative arrow error.

E.g. the following file with glaring error in the first element of the horizon column:

#> # A tibble: 10 × 8
#>    forecast_date target_end_date horizon   target           location output_type
#>    <date>        <date>          <chr>     <chr>            <chr>    <chr>      
#>  1 2023-05-01    2023-05-08      horizon 1 wk ahead inc fl… US       mean       
#>  2 2023-05-01    2023-05-15      2         wk ahead inc fl… US       mean       
#>  3 2023-05-01    2023-05-08      1         wk ahead inc fl… US       quantile   
#>  4 2023-05-01    2023-05-08      1         wk ahead inc fl… US       quantile   
#>  5 2023-05-01    2023-05-08      1         wk ahead inc fl… US       quantile   
#>  6 2023-05-01    2023-05-08      1         wk ahead inc fl… US       quantile   
#>  7 2023-05-01    2023-05-08      1         wk ahead inc fl… US       quantile   
#>  8 2023-05-01    2023-05-08      1         wk ahead inc fl… US       quantile   
#>  9 2023-05-01    2023-05-08      1         wk ahead inc fl… US       quantile   
#> 10 2023-05-01    2023-05-08      1         wk ahead inc fl… US       quantile   
#> # ℹ 2 more variables: output_type_id <chr>, value <dbl>

^{Created on 2023-10-11 with reprex v2.0.2} fails with:

Error:
! Invalid: In CSV column #2: CSV conversion error to int32: invalid value 'horizon 1'

whit1951 commented 8 months ago

Hi Anna, I reinstalled the latest version, and while the original errors have resolved, I'm receiving a new/different error (screen shot below). It seems like output_type_id is still being read in as a numeric class, so any quantile that doesn't already have three non-zero decimals (e.g., 0.975) is showing up as a mismatch for this field.

Again, I would imagine this is potentially a special case for those submitting quantiles only. I am reasonably certain that I am formatting the file correctly for writing/saving.

Thanks, Lauren

annakrystalli commented 8 months ago

Interesting. That looks like it should indeed be passing given the valid values in tasks config. Will investigate. Thanks for reporting!

hubverse-org / hubValidations

Unexpected error about `output_type_id` data type with FluSight submission #54