hubverse-org / hubUtils

Utility functions for Infectious Disease Modeling Hubs
https://hubverse-org.github.io/hubUtils/
Other
6 stars 3 forks source link

Improve output message of connect_hub() #124

Closed LucieContamin closed 8 months ago

LucieContamin commented 11 months ago

So currently if we are loading a hub, without issue we obtain:

 hubUtils::connect_hub("example-simple-forecast-hub/")
#> 
#> ── <hub_connection/FileSystemDataset> ──
#> 
#> • hub_name: "Simple Forecast Hub"
#> • hub_path: 'example-simple-forecast-hub/'
#> • file_format: "csv(9)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#>   "example-simple-forecast-hub/model-output"
#> • config_admin: 'hub-config/admin.json'
#> • config_tasks: 'hub-config/tasks.json'
#> 
#> ── Connection schema
#> hub_connection with 9 csv files
#> origin_date: date32[day]
#> horizon: int32
#> location: string
#> target: string
#> output_type: string
#> output_type_id: double
#> value: int32
#> model_id: string

Created on 2023-11-10 with reprex v2.0.2

However if we update one of the CSV incorrectly (for example did not update the task.json accordingly to adapt or have a column in an expected format, etc.)

hubUtils::connect_hub("example-simple-forecast-hub/")
#> 
#> ── <hub_connection/FileSystemDataset> ──
#> 
#> • hub_name: "Simple Forecast Hub"
#> • hub_path: 'example-simple-forecast-hub/'
#> • file_format: "csv(8)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#>   "example-simple-forecast-hub/model-output"
#> • config_admin: 'hub-config/admin.json'
#> • config_tasks: 'hub-config/tasks.json'
#> 
#> ── Connection schema
#> hub_connection with 8 csv files
#> origin_date: date32[day]
#> horizon: int32
#> location: string
#> target: string
#> output_type: string
#> output_type_id: double
#> value: int32
#> model_id: string

Created on 2023-11-10 with reprex v2.0.2

So as expected the number of CSV files in the second example is 8 instead of 9 as the "incorrect" one is not included. However, you load directly the data by doing:

 hubUtils::connect_hub(hub_path) %>%
   dplyr::collect()

or if we don't look at the hub connection object in details, it's easy to miss that one file was not included.

So I wonder if it would be helpful to print a message/warning when something like this happen?

annakrystalli commented 9 months ago

Hey @LucieContamin ! This was a great idea and I've just pushed the feature to this PR https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/pull/131

Now when printing a hub_connection object, it reports the number of files per file format out of those available in the directory:

hub_path <- system.file("testhubs/simple", package = "hubUtils")
hubUtils::connect_hub(hub_path) 
#> 
#> ── <hub_connection/UnionDataset> ──
#> 
#> • hub_name: "Simple Forecast Hub"
#> • hub_path:
#>   '/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/hubUtils/testhubs/simple'
#> • file_format: "csv(3/3)" and "parquet(1/1)"
#> • file_system: "LocalFileSystem"
#> • model_output_dir:
#>   "/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/hubUtils/testhubs/simple/model-output"
#> • config_admin: 'hub-config/admin.json'
#> • config_tasks: 'hub-config/tasks.json'
#> 
#> ── Connection schema
#> hub_connection
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> output_type: string
#> output_type_id: double
#> value: int32
#> model_id: string
#> age_group: string

Created on 2024-01-10 with reprex v2.0.2

It will also throw a warning if there are unopenned files and identify the files with problems. One thing I could use your input on for testing, is there a simple situation that has created this problem for you that I could re-create as a test case? i.e. what sort of problems did you find created missing individual files?