ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 129 forks source link

Target always outdated #1298

Closed jennysjaarda closed 4 years ago

jennysjaarda commented 4 years ago

Prework

Description

I have a fairly complex plan and I noticed that I one target is always triggering downstream targets as outdated. I have no idea why and I was wondering if you could help me isolate the problem. I have narrowed it down to this function below. I have tried running deps_profile on this function but I don't really know what to look for. Any suggestions would be greatly appreciated!


munge_pheno_follow <-  function(pheno_baseline, test_drugs, i) {
  out <- list()
  #for(i in 1:dim(test_drugs)[1])
  #{
    drug_list <- unlist(test_drugs %>% dplyr::select(drugs) %>% dplyr::slice(i))
    drug_class <- unlist(test_drugs %>% dplyr::select(class) %>% dplyr::slice(i))
    col_match <- paste(paste0(drug_list,"_ever_drug"), collapse = "|")
    followup_data <- pheno_baseline %>% rowwise() %>%
      dplyr::do({
            result = as_tibble(.) # result <- pheno_baseline %>% slice(49)
            x =  classify_drugs(result,drug_list,low_inducers, high_inducers)
            result$high_inducer=x$case
            result$high_inducer_drug_num=x$drug_num
            result$high_inducer_drug_name=x$drug_name
            result$bmi_change=x$bmi_diff
            result$bmi_change_6mo=x$bmi_diff_6mo
            result$follow_up_time=x$duration
            result$follow_up_time_6mo=x$duration_6mo
            result$age_started=x$age_started
            result$bmi_start=x$bmi_start
            result$bmi_slope=x$bmi_slope
            result$bmi_slope_6mo=x$bmi_slope_6mo
            result$bmi_slope_weight=x$bmi_slope_weight
            result$bmi_slope_weight_6mo=x$bmi_slope_weight_6mo

            result
        }) %>% ungroup() %>%
      mutate(ever_drug_match = rowSums(dplyr::select(., matches(col_match)) == 1) > 0) %>%
      filter(!(ever_drug_match & high_inducer==0)) %>% ##filter out individuals who have taken this drug but it wasn't followed
      mutate(follow_up_time_sq = follow_up_time^2) %>%
      mutate(follow_up_time_6mo_sq = follow_up_time_6mo^2) %>%
      mutate(age_sq=age_started^2)

    out[[drug_class]] <- followup_data
  #}
  return(out)
}

The plan looks like this:


analysis_prep <- drake_plan(
  # prepare phenotype files for analysis in GWAS/GRS etc. ------------

  pheno_raw = readr::read_delim(file_in(!!pheno_file), col_types = cols(.default = col_character()), delim = ",") %>% type_convert(),
  caffeine_raw = target(read_excel(file_in(!!caffeine_file), sheet=1) %>% type_convert(),
    hpc = FALSE),
  # found a mistake that the age of WIFRYNSK was wrong at one instance.

  caffeine_munge = munge_caffeine(caffeine_raw),
  bgen_sample_file = target({
    readr::read_delim(file_in(!!paste0("analysis/QC/15_final_processing/FULL/", study_name, ".FULL.sample")),
      col_types = cols(.default = col_character()), delim = " ") %>% type_convert()
    }),
  bgen_nosex_out = write.table(bgen_sample_file[,1:3],file_out(!!paste0("analysis/QC/15_final_processing/FULL/", study_name, ".FULL_nosex.sample")),row.names = F, quote = F, col.names = T),
  pc_raw = read_pcs(file_in(!!pc_dir), !!study_name, !!eths) %>% as_tibble(),

  pheno_munge = munge_pheno(pheno_raw, !!baseline_vars, !!leeway_time, caffeine_munge, !!follow_up_limit), # pheno_munge %>% count(GEN) %>% filter(n!=1) ## NO DUPLICATES !
  #pheno_munge = munge_pheno(pheno_raw, baseline_vars, leeway_time, caffeine_munge, follow_up_limit)
  pheno_merge = merge_pheno_caffeine(pheno_raw, caffeine_munge, !!anonymization_error), # in the end, we don't use this dataset, but if you want to merge by caffeine with the appropriate date use this data.

  pheno_baseline = inner_join(pc_raw %>% mutate_at("GPCR", as.character), pheno_munge %>% mutate_at("GEN", as.character), by = c("GPCR" = "GEN")) %>%
    replace_na(list(sex='NONE')),
  pheno_eths_out = write.table(pheno_baseline %>% tidyr::separate(FID, c("COUNT", "GPCR"), "_") %>%
    dplyr::select(COUNT,GPCR,eth ), file_out(!!paste0("data/processed/phenotype_data/", study_name, "_inferred_eths.txt")), row.names = F, quote = F, col.names = T),
  test_drugs_num = tibble(i = 1:dim(!!test_drugs)[1]),
  pheno_followup = target(munge_pheno_follow(pheno_baseline, !!test_drugs, test_drugs_num$i),
    dynamic = map(test_drugs_num))

## more targets that I have not included here, all our outdated as a result of pheno_followup being outdated
)
jennysjaarda commented 4 years ago

I think I solved it, by a stroke of luck. The problematic function was calling another function as follows: classify_drugs(result,drug_list,low_inducers, high_inducers) and the variables low_inducers and high_inducers were not defined within the function.

When I changed the function to include these as inputs it seems to have solved the outdated issue. The drake_plan was also updated as follows: munge_pheno_follow(pheno_baseline, !!test_drugs, test_drugs_num$i, !!low_inducers, !!high_inducers)

I hope that is consistent with how drake behaves! I know it isn't good practice to use globally defined variables within a function, I just forgot to include them as inputs in this case.