hubverse-org / hubverse-transform

Data transform functions for hubverse model-output files
MIT License
1 stars 0 forks source link

Temporary schema patch for location and output_type_id columns #26

Closed bsweger closed 3 months ago

bsweger commented 3 months ago

Resolves #24

Apply temporary logic to force ModelOutputHandler to force columns named output_type_id and location to string when writing parquet data.

This one isn't worth reviewing commit by commit, since there was a bit of thrashing around re: applying a schema on read (vs applying schema changes after reading data to pyarrow).


Reproducing the location schema mismatch error we're attempting to fix:

library(dplyr)
library(hubData)

hub_path_cloud <- s3_bucket("bsweger-flusight-forecast/")
hub_con <- connect_hub(hub_path_cloud, file_format = "parquet", skip_checks=TRUE)

> hub_con %>%
+ filter(output_type == "quantile", location=="US") %>%
+ collect()

Error in `compute.arrow_dplyr_query()`:
! NotImplemented: Function 'equal' has no kernel matching input types (string, int64)

After re-processing the "raw" model_output files using an updated lambda based on this branch, I was able to run the above code successfully:

> hub_path_cloud <- s3_bucket("bsweger-flusight-forecast/")
> hub_con <- connect_hub(hub_path_cloud, file_format = "parquet", skip_checks=TRUE)
> hub_con %>%
+ filter(output_type == "quantile", location=="US") %>%
+ collect()
# A tibble: 100,395 × 9
   reference_date target          horizon location target_end_date output_type output_type_id value model_id        
   <date>         <chr>             <int> <chr>    <date>          <chr>       <chr>          <dbl> <chr>           
 1 2023-10-14     wk inc flu hosp      -1 US       2023-10-07      quantile    0.01            402. CEPH-Rtrend_fluH
 2 2023-10-14     wk inc flu hosp       0 US       2023-10-14      quantile    0.01            236  CEPH-Rtrend_fluH
 3 2023-10-14     wk inc flu hosp       1 US       2023-10-21      quantile    0.01            157. CEPH-Rtrend_fluH