ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Replicating target names in different plans #1361

Closed jennysjaarda closed 3 years ago

jennysjaarda commented 3 years ago

Prework

Description

Hello, I've seen you mention before that target names within a project should be unique (e.g. here). However, I have multiple plans within one project because my plans have different computing requirements (I am using the clustermq backend rather than future as clustermq seems to be more reliable and much faster). If I replicate two targets exactly, I don't seem to have any problems. See an example from the book below.

Reproducible example

gapminder_continents <- function() {
  gapminder %>%
    mutate(gdpPercap = scale(gdpPercap)) %>%
    split(f = .$continent)
}

# Fit a model to a continent.
fit_model <- function(continent_data) {
  data <- continent_data[[1]]
  data %>%
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(continent = data$continent[1]) %>%
    select(continent, term, statistic, p.value)
}

plan1 <- drake_plan(
  continents = gapminder_continents(),
  model = target(fit_model(continents), dynamic = map(continents))
)

plan2 <- drake_plan(
  continents = gapminder_continents(),

)

make(plan1)
make(plan2)
outdated(plan1)

Desired result

The output is as expected. Is the above bad practice? Do you see any issue with proceeding as I have above?

Thanks a lot in advance!

jennysjaarda commented 3 years ago

Just a quick comment. I think I used to have an issue with duplicate targets as I have a comment in my script:

# this should be identical to targetXX above, but needs to be a different name

Perhaps an update allowed for duplicate targets in one project?

wlandau commented 3 years ago

I would recommend this:

whole_plan <- drake_plan(
  continents = gapminder_continents(),
  model = target(fit_model(continents), dynamic = map(continents))
)

make(whole_plan, targets = "continents")
make(whole_plan)

This avoids potential unintended consequences of omitting targets from the plan itself.

By the way, targets::tar_make() and targets::tar_make_clustermq() support an analogous names argument. Unlike drake, targets supports tidyselect syntax, e.g. tar_make(names = starts_with("abc")).

jennysjaarda commented 3 years ago

Thanks for the suggestion. It's not entirely clear to me. So the targets argument specifies explicitly the targets that you want to run? What if the plan2 as defined as above has many more targets, such that it wouldn't be very practical to specify explicitly the targets? For example:

plan1 <- drake_plan(
  data_geno = load_geno(),
  data_pheno = load_pheno(),
  munge_geno = geno_munge(data_geno), 
  munge_pheno = pheno_munge(data_pheno), 
  analysis1 = simple_analysis(munge_pheno, munge_geno)

)

make(plan1, 
  parallelism = "clustermq", template = list(cpus = 1))

plan2 <- drake_plan(
  data_geno = load_geno(),
  data_pheno = load_pheno(),
  munge_geno = geno_munge(data_geno), 
  munge_pheno = pheno_munge(data_pheno), 
  analysis2 = computationally_intense_analysis(munge_pheno, munge_geno)

)

make(plan2, 
  parallelism = "clustermq", template = list(cpus = 8))

Would the above work? Sorry if I'm missing something obvious.

jennysjaarda commented 3 years ago

By the way, I just noticed the new {targets} package! (I am just returning from being away on parental leave for several months which is why I missed it.) Looking forward to trying it out. My current project is quite embedded with the {drake} infrastructure so I guess it's best to keep it with drake. If I migrated I suppose all targets would be outdated and need to be re-run?

wlandau commented 3 years ago

So the targets argument specifies explicitly the targets that you want to run?

Yes, plus anything upstream that those targets depend on. That is the issue. Upstream targets need to be processed before downstream ones can start. If some targets in plan2 have dependencies in plan1, then make(plan2) will not work as expected. That is why I do not encourage this pattern. The following is safer and achieves the same goals.

plan1 <- drake_plan(...)
plan2 <- drake_plan(...)
whole_plan <- bind_plans(plan1, plan2)
make(whole_plan, targets = plan1$target, parallelism = "clustermq", template = list(cpus = 1))
make(whole_plan, targets = plan2$target, parallelism = "clustermq", template = list(cpus = 8))

My current project is quite embedded with the {drake} infrastructure so I guess it's best to keep it with drake.

Yes, old projects can keep using drake, and new projects should use targets. As you probably saw, drake is not deprecated, just superseded. I will keep it running indefinitely so old projects do not break, but I do not plan to add significant enhancements.