hubverse-org / hubData

Tools for accessing and working with hubverse Hub data
https://hubverse-org.github.io/hubData/
Other
3 stars 4 forks source link

Better messaging when arrow::open_dataset throws error whithin connect_hub #7

Open annakrystalli opened 1 year ago

annakrystalli commented 1 year ago

Errors produced during the arrow::open_dataset() from problems involving anything from columns in schema provided not matching columns in data (e.g. when trying to open data that still had type and type_id columns when we changed to output_type and output_type_id) to mis-specification of field data type (e.g. trying to cast double or character column to integer) produces the same wildly uninformative error.

For example, here I try to cast character field output_type as int32 data type.

library(hubUtils)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

model_output_schema <- schema(
  origin_date = date32(),
  target = string(),
  horizon = int32(),
  location = string(),
  output_type = int32(),
  output_type_id = string(),
  value = int32(),
  model_id = string()
)

model_output_dir <- system.file("testhubs/simple/model-output", package = "hubUtils")
mod_out_con <- connect_model_output(model_output_dir, file_format = "csv",
                                    schema = model_output_schema)
#> Error in `arrow::open_dataset()` at hubUtils/R/connect_model_output.R:32:8:
#> ! Invalid: No non-null segments were available for field 'model_id'; couldn't infer type
#> Backtrace:
#>     ▆
#>  1. ├─hubUtils::connect_model_output(...)
#>  2. └─hubUtils:::connect_model_output.default(...) at hubUtils/R/connect_model_output.R:17:4
#>  3.   └─arrow::open_dataset(...) at hubUtils/R/connect_model_output.R:32:8
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, format = format)
#>  9.               └─rlang::abort(msg, call = call)

Created on 2023-06-13 with reprex v2.0.2

The error thrown:

#> Error in `arrow::open_dataset()` at hubUtils/R/connect_model_output.R:32:8:
#> ! Invalid: No non-null segments were available for field 'model_id'; couldn't infer type

has sent us on many a wild goose chase while not providing any useful pointers to actual problem and will likely be even more confusing to downstream hub users.

Our options are:

  1. Report the poor error handling to arrow and wait for a resolution in the package itself.
  2. Try and capture, analyse and produce are own messages within hubUtils.

I feel we should definitely report the behaviour whatever else we decide. While I'm leaning towards 2 out of principle that our functions are currently resulting in really unhelpful error messages, it may not be that straight forward to implement.

elray1 commented 1 year ago

I like the idea of 1 as a temporary solution, with a goal to do 2 if still necessary later on, once we've got hubValidations in place