Apply temporary logic to force ModelOutputHandler to force columns named output_type_id and location to string when writing parquet data.
This one isn't worth reviewing commit by commit, since there was a bit of thrashing around re: applying a schema on read (vs applying schema changes after reading data to pyarrow).
Reproducing the location schema mismatch error we're attempting to fix:
library(dplyr)
library(hubData)
hub_path_cloud <- s3_bucket("bsweger-flusight-forecast/")
hub_con <- connect_hub(hub_path_cloud, file_format = "parquet", skip_checks=TRUE)
> hub_con %>%
+ filter(output_type == "quantile", location=="US") %>%
+ collect()
Error in `compute.arrow_dplyr_query()`:
! NotImplemented: Function 'equal' has no kernel matching input types (string, int64)
After re-processing the "raw" model_output files using an updated lambda based on this branch, I was able to run the above code successfully:
Resolves #24
Apply temporary logic to force
ModelOutputHandler
to force columns namedoutput_type_id
andlocation
to string when writing parquet data.This one isn't worth reviewing commit by commit, since there was a bit of thrashing around re: applying a schema on read (vs applying schema changes after reading data to pyarrow).
Reproducing the
location
schema mismatch error we're attempting to fix:After re-processing the "raw" model_output files using an updated lambda based on this branch, I was able to run the above code successfully: