Closed elray1 closed 7 months ago
Given the overall hub's output_type_id
will be character I would personally recommend staying on the safe side of the arrow
package and encoding output_type_id
as character in all files as a way to avoid potential unexpected problems down the road with accessing data through queries.
Having said that, I see what you mean...even if the column was encoded as character, by default it is read in as double... hmmm
Essentially, csv files don't have a way to encode data types. You can wrap field values in quotes, but that's a part of the specification for delimiting fields, not a way to say whether it's a character or a numeric -- and any field may be quoted. So both
...,0.100,...
and
...,"0.100",...
are valid representations of both the numeric value 0.1
and the string value "0.100"
, and there is no way to tell from the file contents alone (unless some other value in the same column and a different row is clearly a string, in which case R will figure that the column as a whole must be a string data type).
Right. So while this check should work for parquet and arrow files, I've been struggling to think what this should entail for CSV files.
I guess we could try coercing the tbl
to the schema and if any problems arise, the file essentially fails this check? It does imply that we are checking slightly different things for different file types (and hence will return different error messages for each).
In any case, given its turned up already in a submission (see Rebecca's email), I guess we have to do sth different, even if temporarily until a more robust approach is decided?
Yeah. One note is that I think if possible, we likely want to specify the data types at the time the csv is read in rather than coercing it after reading it in. For instance, in this case we would like to check that the output_type_ids
are strings like "0.010"
, but if we first read csvs with conversion to numeric and then try to convert back to string we may end up with "0.01"
which would then not match the required/optional value possibilities which are formatted with 3 digits.
Might there be a way to do something where we just try to read the files specifying a schema with expected data types, and see if that succeeds or throws errors?
So I've actually included such an option in the read_model_out_file
function for CSVs. The only issue is that if for some reason the coercion doesn't work on read then it's the check that the file can be read that will fail instead. Not sure if that is an real problem or not but worth being aware of it.
Hmmm.
Maybe a temporary solution could be to just go to using that option and coarser check for now?
And then I don't know exactly, but:
read_model_out_file
method for reading in again? A bit awkward, but possibly more robust?Fixed this iin #55 for now.
Not that is there is a major problem with the data that does not allow the schema to be applied correctly in csvs, the check_file_read
fails with a semi informative arrow
error.
E.g. the following file with glaring error in the first element of the horizon column:
#> # A tibble: 10 × 8
#> forecast_date target_end_date horizon target location output_type
#> <date> <date> <chr> <chr> <chr> <chr>
#> 1 2023-05-01 2023-05-08 horizon 1 wk ahead inc fl… US mean
#> 2 2023-05-01 2023-05-15 2 wk ahead inc fl… US mean
#> 3 2023-05-01 2023-05-08 1 wk ahead inc fl… US quantile
#> 4 2023-05-01 2023-05-08 1 wk ahead inc fl… US quantile
#> 5 2023-05-01 2023-05-08 1 wk ahead inc fl… US quantile
#> 6 2023-05-01 2023-05-08 1 wk ahead inc fl… US quantile
#> 7 2023-05-01 2023-05-08 1 wk ahead inc fl… US quantile
#> 8 2023-05-01 2023-05-08 1 wk ahead inc fl… US quantile
#> 9 2023-05-01 2023-05-08 1 wk ahead inc fl… US quantile
#> 10 2023-05-01 2023-05-08 1 wk ahead inc fl… US quantile
#> # ℹ 2 more variables: output_type_id <chr>, value <dbl>
Created on 2023-10-11 with reprex v2.0.2 fails with:
Error:
! Invalid: In CSV column #2: CSV conversion error to int32: invalid value 'horizon 1'
Hi Anna,
I reinstalled the latest version, and while the original errors have resolved, I'm receiving a new/different error (screen shot below). It seems like output_type_id
is still being read in as a numeric class, so any quantile that doesn't already have three non-zero decimals (e.g., 0.975) is showing up as a mismatch for this field.
Again, I would imagine this is potentially a special case for those submitting quantiles only. I am reasonably certain that I am formatting the file correctly for writing/saving.
Thanks, Lauren
Interesting. That looks like it should indeed be passing given the valid values in tasks
config. Will investigate. Thanks for reporting!
Validating the attached file yields the following error:
I don't think this errors makes sense, since that column could be read in with a character type; a data type issue here is more an issue with how the file was read in rather than how the data was encoded.