Replicating target names in different plans

jennysjaarda commented 3 years ago

Prework

[X] Read and agree to the code of conduct and contributing guidelines.
[X] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
[X] Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- [X] Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- [X] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- [X] Readable: format your code according to the tidyverse style guide.

Description

Hello, I've seen you mention before that target names within a project should be unique (e.g. here). However, I have multiple plans within one project because my plans have different computing requirements (I am using the clustermq backend rather than future as clustermq seems to be more reliable and much faster). If I replicate two targets exactly, I don't seem to have any problems. See an example from the book below.

Reproducible example

[X] Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- [X] Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- [X] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- [X] Readable: format your code according to the tidyverse style guide.

gapminder_continents <- function() {
  gapminder %>%
    mutate(gdpPercap = scale(gdpPercap)) %>%
    split(f = .$continent)
}

# Fit a model to a continent.
fit_model <- function(continent_data) {
  data <- continent_data[[1]]
  data %>%
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(continent = data$continent[1]) %>%
    select(continent, term, statistic, p.value)
}

plan1 <- drake_plan(
  continents = gapminder_continents(),
  model = target(fit_model(continents), dynamic = map(continents))
)

plan2 <- drake_plan(
  continents = gapminder_continents(),

)

make(plan1)
make(plan2)
outdated(plan1)

Desired result

The output is as expected. Is the above bad practice? Do you see any issue with proceeding as I have above?

Thanks a lot in advance!

jennysjaarda commented 3 years ago

Just a quick comment. I think I used to have an issue with duplicate targets as I have a comment in my script:

# this should be identical to targetXX above, but needs to be a different name

Perhaps an update allowed for duplicate targets in one project?

wlandau commented 3 years ago

I would recommend this:

whole_plan <- drake_plan(
  continents = gapminder_continents(),
  model = target(fit_model(continents), dynamic = map(continents))
)

make(whole_plan, targets = "continents")
make(whole_plan)

This avoids potential unintended consequences of omitting targets from the plan itself.

By the way, targets::tar_make() and targets::tar_make_clustermq() support an analogous names argument. Unlike drake, targets supports tidyselect syntax, e.g. tar_make(names = starts_with("abc")).

jennysjaarda commented 3 years ago

Thanks for the suggestion. It's not entirely clear to me. So the targets argument specifies explicitly the targets that you want to run? What if the plan2 as defined as above has many more targets, such that it wouldn't be very practical to specify explicitly the targets? For example:

plan1 <- drake_plan(
  data_geno = load_geno(),
  data_pheno = load_pheno(),
  munge_geno = geno_munge(data_geno), 
  munge_pheno = pheno_munge(data_pheno), 
  analysis1 = simple_analysis(munge_pheno, munge_geno)

)

make(plan1, 
  parallelism = "clustermq", template = list(cpus = 1))

plan2 <- drake_plan(
  data_geno = load_geno(),
  data_pheno = load_pheno(),
  munge_geno = geno_munge(data_geno), 
  munge_pheno = pheno_munge(data_pheno), 
  analysis2 = computationally_intense_analysis(munge_pheno, munge_geno)

)

make(plan2, 
  parallelism = "clustermq", template = list(cpus = 8))

Would the above work? Sorry if I'm missing something obvious.

jennysjaarda commented 3 years ago

By the way, I just noticed the new {targets} package! (I am just returning from being away on parental leave for several months which is why I missed it.) Looking forward to trying it out. My current project is quite embedded with the {drake} infrastructure so I guess it's best to keep it with drake. If I migrated I suppose all targets would be outdated and need to be re-run?

wlandau commented 3 years ago

So the targets argument specifies explicitly the targets that you want to run?

Yes, plus anything upstream that those targets depend on. That is the issue. Upstream targets need to be processed before downstream ones can start. If some targets in plan2 have dependencies in plan1, then make(plan2) will not work as expected. That is why I do not encourage this pattern. The following is safer and achieves the same goals.

plan1 <- drake_plan(...)
plan2 <- drake_plan(...)
whole_plan <- bind_plans(plan1, plan2)
make(whole_plan, targets = plan1$target, parallelism = "clustermq", template = list(cpus = 1))
make(whole_plan, targets = plan2$target, parallelism = "clustermq", template = list(cpus = 8))

My current project is quite embedded with the {drake} infrastructure so I guess it's best to keep it with drake.

Yes, old projects can keep using drake, and new projects should use targets. As you probably saw, drake is not deprecated, just superseded. I will keep it running indefinitely so old projects do not break, but I do not plan to add significant enhancements.

ropensci / drake