hubverse-org / hubValidations

Testing framework for hubverse hub validations
https://hubverse-org.github.io/hubValidations/
Other
1 stars 4 forks source link

`check_tbl_values_required()` fails to detect missing required task IDs for samples #123

Closed zkamvar closed 2 days ago

zkamvar commented 2 days ago

An issue was found in https://github.com/reichlab/variant-nowcast-hub/issues/83, which is tested in https://github.com/reichlab/variant-nowcast-hub/pull/90 where check_tbl_values_required() does not invalidate the submission when required task ID values are missing in a hub that is taking sample data.

tmp <- tempfile()
tmpdir <- tempfile()
dir.create(tmpdir)
download.file("https://github.com/IsaacMacarthur/variant-nowcast-hub/archive/b80f0559c7fd019b569f39437cf684aca57639a6.zip", tmp)
unzip(tmp, exdir = tmpdir)

library("hubValidations")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:hubValidations':
#> 
#>     combine
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
hub_path <- list.files(tmpdir, full.names = TRUE)
file_path <- "UMass-HMLR/2024-10-02-UMass-HMLR.parquet"
round_id <- "2024-10-02"
tbl <- read_model_out_file(hub_path = hub_path, file_path = file_path) %>%
  mutate_at(vars(ends_with("date")), as.character)

There are supposed to be specific clades in the task IDs

tasks <- hubUtils::read_config_file(file.path(hub_path, "hub-config", "tasks.json"))
purrr::map(tasks$rounds, list("model_tasks", 1, "task_ids", "clade"))
#> [[1]]
#> [[1]]$required
#> [1] "24A"         "24B"         "24C"         "other"       "recombinant"
#> 
#> [[1]]$optional
#> NULL
#> 
#> 
#> [[2]]
#> [[2]]$required
#> [1] "24A"         "24B"         "24C"         "other"       "recombinant"
#> 
#> [[2]]$optional
#> NULL
#> 
#> 
#> [[3]]
#> [[3]]$required
#> [1] "24A"         "24B"         "24C"         "other"       "recombinant"
#> 
#> [[3]]$optional
#> NULL
#> 
#> 
#> [[4]]
#> [[4]]$required
#> [1] "24A"         "24B"         "24C"         "24E"         "other"      
#> [6] "recombinant"
#> 
#> [[4]]$optional
#> NULL

There are several clades missing in the submission file

unique(tbl$clade)
#> [1] "recombinant" "other"

The checks succeed regardless

check_tbl_values_required(tbl, round_id, hub_path = hub_path, file_path = file_path)
#> <message/check_success>
#> Message:
#> Required task ID/output type/output type ID combinations all present.

Created on 2024-10-02 with reprex v2.1.1

Analysis

I might have been chasing a red herring, but when I dug into this, I found that expand_model_out_grid() does not create a grid when there are no output_task_ids to expand.

This function is expected to take a list of required task IDs and a list of required output type IDs and return a data frame that contains the expansion of both.

https://github.com/hubverse-org/hubValidations/blob/85d58251c7cbcbb4064a72f8126ad8846f351dbd/R/expand_model_out_grid.R#L245-L259

The problem comes when there are no required output type ids (e.g. in the default case where we are not expanding sample output type ids). If I add a catch before this to return just the expanded task ID table, then the submission is correctly invalidated, BUT there are few tests that subsequently fail due to bad joins.

# Function that expands modeling task level lists of task IDs and output type
# values into a grid and combines them into a single tibble.
expand_output_type_grid <- function(task_id_values,
                                    output_type_values) {
+ if (length(output_type_values) == 0) {
+   return(expand.grid(purrr::compact(task_id_values), stringsAsFactors = FALSE)))
+ } 
  purrr::imap(
    output_type_values,
    ~ c(task_id_values, list(
      output_type = .y,
      output_type_id = .x
    )) %>%
      purrr::compact() %>%
      expand.grid(stringsAsFactors = FALSE)
  ) %>%
    purrr::list_rbind()
}