ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

stantargets: reproducible Stan pipelines at scale #430

Closed wlandau closed 3 years ago

wlandau commented 3 years ago

Submitting Author: Will Landau (@wlandau)
Repository: https://github.com/wlandau/stantargets Version submitted: 0.0.0.9000 Editor: @melvidoni Reviewers: @sakrejda @mattwarkentin

Due date for @sakrejda: 2021-03-31 Due date for @mattwarkentin: 2021-03-31

Archive: TBD
Version accepted: TBD


Package: stantargets
Title: Targets for Stan Workflows
Description: Bayesian data analysis usually incurs long runtimes
  and cumbersome custom code. A specialized pipeline toolkit for
  Bayesians, the 'stantargets' R package leverages
  'targets' and 'cmdstanr' to ease these burdens.
  'stantargets' makes it super easy to set up useful scalable
  Stan pipelines that automatically parallelize the computation
  and skip expensive steps when the results are already up to date.
  Minimal custom code is required, and there is no need to manually
  configure branching, so usage is much easier than 'targets' alone.
  'stantargets' can access all of 'cmdstanr''s major algorithms
  (MCMC, variational Bayes, and optimization) and it supports
  both single-fit workflows and multi-rep simulation studies.
  For the statistical methodology, please refer to 'Stan' documentation
  (Stan Development Team 2020) <https://mc-stan.org/>.
Version: 0.0.0.9000
License: MIT + file LICENSE
URL: https://wlandau.github.io/stantargets/, https://github.com/wlandau/stantargets
BugReports: https://github.com/wlandau/stantargets/issues
Authors@R: c(
  person(
    given = c("William", "Michael"),
    family = "Landau",
    role = c("aut", "cre"),
    email = "will.landau@gmail.com",
    comment = c(ORCID = "0000-0003-1878-3253")
  ),
  person(
    family = "Eli Lilly and Company",
    role = "cph"
  ))
Depends:
  R (>= 3.5.0)
Imports:
  cmdstanr (>= 0.2.0),
  digest (>= 0.6.21),
  fst (>= 0.9.4),
  posterior (>= 0.1.2),
  purrr (>= 0.3.4),
  qs (>= 0.14.1),
  rlang (>= 0.4.8),
  stats,
  targets (>= 0.0.1),
  tarchetypes (>= 0.0.1),
  tibble (>= 3.0.4),
  tools
Suggests:
  dplyr (>= 1.0.2),
  fs (>= 1.5.0),
  knitr (>= 1.28),
  R.utils (>= 2.10.1),
  rmarkdown (>= 2.1),
  testthat (>= 3.0.0),
  visNetwork (>= 2.0.9),
  withr (>= 2.1.2)
Remotes:
  stan-dev/cmdstanr,
  stan-dev/posterior
SystemRequirements: CmdStan >= 2.25.0
Encoding: UTF-8
Language: en-US
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
VignetteBuilder: knitr
Config/testthat/edition: 3

Scope

stantargets is very similar to jagstargets (#425). stantargets leverages the existing workflow automation capabilities of targets to orchestrate computation and skip up-to-date tasks in Bayesian data analysis pipelines. stantargets reduces the burden of user-side custom code that targets would otherwise require, which helps free statisticians to focus more on the models and less on the software engineering.

stantargets is for Bayesian statisticians who develop and run Stan models. Example workflows range from individual analyses of clinical data to large-scale simulation-based calibration studies for validation.

targets already provides the same kind of workflow automation, but it requires more custom code to set up a workflow. stantargets uses specialized domain knowledge to make this process easier. Packages rstan and cmdstanr interface with Stan but do not provide the same kind of workflow automation. In light of the recent preprint by Gelman et al. (2020), I believe the Stan Development Team would be very interested in this kind of workflow automation.

N/A

N/A

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

If JOSS is still an option, I would like to publish there. I have prepared a manuscript at https://github.com/wlandau/stantargets/blob/main/inst/paper.md.

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

melvidoni commented 3 years ago

Editor checks:


Editor comments

Thank you for submitting it! I'll be your handling editor this time. Goodpractice is very happy about it:

β™₯ Aha! Fantastic package! Keep up the prime work!

There are no spelling checks, and covr is also happy. Good job!

I will be searching for reviewers then.


Reviewers: @sakrejda @mattwarkentin Due date: Wednesday, March 31st

wlandau commented 3 years ago

Thanks, @melvidoni! Looking forward to the review process. Please let me know if you would like me to suggest potential reviewers.

melvidoni commented 3 years ago

Quick update. I have contacted about 15 people, and received no acceptance to review. Please bear with me while I find someone. And yes, I have your master list of suggestions.

wlandau commented 3 years ago

Thanks for the update, @melvidoni. I do understand this may take some time.

And yes, I have your master list of suggestions.

Do you mean the folks I mentioned at https://github.com/ropensci/software-review/issues/401#issuecomment-711036834 for targets? I actually think about stantargets a bit differently because of its focus on Bayesian Statistics. (Likewise for jagstargets in #425.) For my own learning, it would be amazing to include someone from the Stan Development Team. If not, then I think practicing Bayesian statisticians, particularly Stan users, would have a lot to add.

melvidoni commented 3 years ago

Hello @wlandau. Yes, that list. We curated it with some editors, and also have our own list of people. However, given the current world circumstances, I am afraid that finding reviewers may take longer than expected.

wlandau commented 3 years ago

Yes, that list.

Oh good, glad we are on the same page. Sorry, I may have misremembered the earlier parts of this thread.

However, given the current world circumstances, I am afraid that finding reviewers may take longer than expected.

I completely understand.

For what it's worth, dependency packages cmdstanr and posterior are not yet on CRAN, so this review process is extremely unlikely to bottleneck any of my plans. And unlike targets, I am not developing any packages that depend on stantargets.

mdscheuerell commented 3 years ago

Perhaps Krzysztof Sakrejda would be available?

melvidoni commented 3 years ago

Perhaps Krzysztof Sakrejda would be available?

Hello @sakrejda are you interested?

sakrejda commented 3 years ago

Hi Melina, I am interested and the three-week time frame is reasonable for me. I'm not the right person to say much about whether the package aligns well with how targets itself is commonly used but should be able to provide feedback otherwise. If you haven't asked @tjmahr yet he might be interested in reviewing.

melvidoni commented 3 years ago

Hello @sakrejda thank you kindly! @mattwarkentin will also be reviewing this package.

Please remember to follow the guidelines for the review: https://devguide.ropensci.org/reviewerguide.html

The review deadline will be Wednesday 31st.

tjmahr commented 3 years ago

If you haven't asked @tjmahr yet he might be interested in reviewing.

Thanks for the mention, but I just did a targets/tarchetypes review so a fresher set of eyes would be better. Please do consider me for brmstargets πŸ˜‰ though.

sakrejda commented 3 years ago

@wlandau to make my life easier reviewing, is there a particular overall approach to the API, which functions you are exposing to users and which ones are considered "internal"? I don't immediately recognize the structure.

To be more specific, in the stan compilation functions there are quite a few layers of functions before you actually get down to calling the stan compilation function and it would be helpful to have an idea of why each layer is there. Just to be clear, I don't disagree and I sometimes take a similar approach, but I'm not sure what the goals are there. I assume this is something that happens consistently across the package. Can you reply here pointing out the reasons for the different layers?

wlandau commented 3 years ago

Most stantargets interface functions create target objects. Each target object has a name, some settings, and an R command that will run later (during tar_make()). Putting these things together involves some layers of setup and metaprogramming. These are the layers for standalone compilation of models:

  1. tar_stan_compile(): diffuse/quote any user-supplied language objects and move on to the next layer. In other words, users should be able to write tar_stan_compile(target_name, "model_file.stan") without risk of immediately evaluating target_name as a variable. This may seem trivial in this particular example, but for other target factories like tar_stan_mcmc(), we also need capture the user-defined expression that generates the Stan data (and support tidy evaluation, which is best done in the first call frame).
  2. tar_stan_compile_raw(): Gather all the settings and expressions from (1) into a target object. This layer takes a 10000-foot view of the whole target and cannot afford to go down any rabbit holes. For the sake of keeping a clean codebase and breaking down the problem into manageable pieces, it delegates the more specialized tasks to subsequent deeper layers. In targets and packages I build on top of targets, the "_raw" suffix indicates this intent.
  3. tar_stan_compile_command(): In stantargets, not all the R expressions come directly from user input. Target factories in stantargets automatically supply many of these expressions to reduce the amount of custom code the user needs to write. The metaprogramming to create an expression from non-language input is complicated enough to need its own layer.
  4. tar_stan_compile_run(): actually compile a Stan model when called. The "*_run()" functions significantly reduce burden of metaprogramming in layers like (3). Rather than bootstrapping a large elaborate code chunk in (3), (3) can simply create a function call to invoke (4). (4) usually has no metaprogramming, so it is a much better place to manage the complicated details of how to properly invoke cmdstanr.

(1) is the only layer the user needs to see. (4) still needs to be exported because functions like tar_stan_compile_run() are part of the commands in the target objects, but I hide these functions from the user using @keywords internal.

sakrejda commented 3 years ago

When running devtools::test tests fail unless cmdstanr::install_cmdstan is run first, I think it would be helpful to check the cmdstan installation first. There is the cmdstanr::check_cmdstan_toolchain function that could be run in an initial test, with the fix=TRUE option. This is really a question: are you thinking you'd like to avoid doing this type of check at all because it's too fragile or do you think it would be a good idea to include it. I would lean towards including it since the functionality in the underlying key package is there and could be improved in cmdstanr if needed.

wlandau commented 3 years ago

Good point. keras handles a similar situation with skip_if_not_tensorflow_version (), and testthat::skip_if_not_installed() for Suggests: packages is a community-accepted pattern. Personally, I think this makes sense. Do you agree?

sakrejda commented 3 years ago

That's a great solution, also thanks for the prompt replies!

wlandau commented 3 years ago

Great. Should be fixed now.

sakrejda commented 3 years ago

Aside that's really for the base targets pacakge, but mentioning it here since it affects running the README.md example on one of my systems:

> targets::tar_visnetwork()
Error in viewer(index_html) : 
  'browser' must be a non-empty character string

From the error it's unclear that the answer is option(browser = "<path/to/broswer>"). It would be pretty straightforward to catch the lack of a default browser so that the message could specifically say "A default browser is required to view the network visualization, set a default browser with 'options(browser = "</path/to/browser>")`" or something similar. From the current message I wasn't sure if I need to pass a path to the function or set an option and if I was setting an option I wasn't sure what the option was called.

sakrejda commented 3 years ago

In the README.md I initially got an error b/c x.stan was not found so since I was running from the package root I changed it to inst/example.stan and that made tar_make() complete successfully. For a slightly intimidating package it would be helpful to have the default example in the README.md just work. I think you could retrieve the .stan file through a call to system.file so that the example runs without any changes. It would give a new user some confidence that they could figure the package out.

sakrejda commented 3 years ago

The README.md example seems to reference an internal data generating function for the example rather than the function included in the example (not sure I got that right but I think I did). Here's what I modified it to in order to have it working:

library(targets)
library(stantargets)

generate_data <- function(n = 10) {
  true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- stats::rnorm(n, x * true_beta, 1)
  list(n = n, x = x, y = y, true_beta = true_beta)
}

list(
  tar_stan_mcmc(test, "inst/example.stan", generate_data())
)

Also I wasn't sure exactly how stantargets relates to targets and it would be helpful in the README.md to have a short paragraph (maybe before the "prerequisites" section?) that states that the tar_stan_mcmc target defines sub-targets following a specific naming convention which allow you to schedule chains and access different components of the results. I don't want to nit-pick but taking the perspective of a naive user (I know Stan but I'm new-ish to targets itself) the auto-generation of a bunch of new targets is not obvious and I wasn't initially understanding that I could use tar_read on the auto-generated targets.

wlandau commented 3 years ago

The README.md example seems to reference an internal data generating function for the example rather than the function included in the example

Thanks, just updated the README to use generate_data(). (tar_stan_example_data() is exported, but generate_data() is still better since most users will write their own data functions.)

Also I wasn't sure exactly how stantargets relates to targets and it would be helpful in the README.md to have a short paragraph (maybe before the "prerequisites" section?) that states that the tar_stan_mcmc target defines sub-targets following a specific naming convention which allow you to schedule chains and access different components of the results.

I try to address these points early in the first vignette. I would prefer to keep the README at a high level in order to keep it digestible for new users and to avoid repeating documentation.

sakrejda commented 3 years ago

There's no way to pass additional arguments here: https://github.com/wlandau/stantargets/blob/522a94c3b0941aba10d49bfc756b2a6def1a5577/R/tar_stan_compile.R#L155

That means the pedantic and threads arguments currently can't be passed down. What do you think about adding a compile_dot_args to the tar_stan_compile function that would get passed down to the cmdstanr::cmdstan_model function?

sakrejda commented 3 years ago

I guess that's just a maintenance issue that will come up repeatedly with the way that arguments are currently structured (explicitly passed down and enumerated at each level). In some ways it's helpful, but it also means a user can end up stuck. What do you think about this issue? Is it an explicit tradeoff you're making?

wlandau commented 3 years ago

pedantic may have been created in cmdstanr after I last updated tar_stan_compile(), and threads is deprecated in favor of cpp_options. I intend to include all the formal arguments in the interface so users have a clearer idea how to use the functions. I think this is worth the maintenance cost. I will add pedantic.

wlandau commented 3 years ago

Now fixed.

sakrejda commented 3 years ago

Keeping the stantargets functions updated to the cmdstanr functions makes a lot of sense for generating good user documentation so it's hard to argue with that. I initially thought an automated tool could help (comparing formals, etc..) but the cmdstanr package is written with the opposite strategy where almost everything is passed down through ... so there's no easy way to implement that.

sakrejda commented 3 years ago

In many places I'm noticing a mixed usage of base R's substitute and rlang's tidy_eval and related functions. Sections 19.3.3 and 19.3.4 of Wickham's advanced R book compare substitute with a couple of rlang functions and note some differences in behavior.

  1. The mix of usage and behavior in stantargets may end up being confusing for the user (and other maintainers).
  2. There is a related set of utility functions in need of documentation:
  3. Those functions re-define some well-known names ('tidy_eval' for one) in a way that's not immediately obvious, is there a specific reason for that or could the names be changed?

I'm not sure what the best solution is. Ideally I would like to see a package like this stick to purely using rlang because that seems like the idiom that will stick around for a while, the behavior of those functions is consistent, and more users will ultimately be familiar with it. Additionally the in-between state seems like a significant maintenance burden.. There are lots of solutions (e.g.-just documenting the specific strategy used in the package and the utility functions) so I'd like to hear what your take is.

sakrejda commented 3 years ago

Minor comments:

Utility function without documentation, relies on file.exists which means assert_path will return without error even if the path is a directory, is that intentional?

The assert_stan_file function will return without error even if the path to the Stan file is a path to a directory, most users wouldn't create <blah.stan> directory but with automation it certainly happens. My preferred solution is to use fs::is_file which wraps both the existence and type check, but isFALSE(file.info('<path>')$isdi)r returns the same information.

The assert_package function uses requireNamespace which loads the package namespace as a side-effect... that's probably fine although using installed.packages would avoid that side-effect.

For utils/condition.R relying on rlang::abort rather than stop would make it possible to return the objects used in whatever function failed the assertion rather than just a text string. I realize actually making that switch might be a fairly extensive set of changes so just mentioning it.

Confusing code in utils/output.R since output is not defined as a parameter or locally

Stan fit objects can get very large, is there a specific reason for loading all the cached data into the object here? That approach can make it impossible to use the function due to RAM constraints.

wlandau commented 3 years ago

Re https://github.com/ropensci/software-review/issues/430#issuecomment-806590196,

  1. The mix of usage and behavior in stantargets may end up being confusing for the user (and other maintainers).

My strategy for rlang is to only choose components with significant advantages over base R for this specific set of use cases. I just refactored stantargets a little to make that clearer. As of just now, the only functions imported from rlang are:

I hesitate to depend on more rlang functions that already have equivalents in base R or stantargets because I have noticed changes in rlang over the years, and adjustment is a maintenance cost. Similar maintenance costs are already high because cmdstanr is developing rapidly.

There is a related set of utility functions in need of documentation.

What sort of documentation do you have in mind? I do not usually write roxygen2 docstrings for unexported functions, but is that what you are thinking?

Those functions re-define some well-known names ('tidy_eval' for one) in a way that's not immediately obvious, is there a specific reason for that or could the names be changed?

There exists an rlang::eval_tidy() but no rlang::tidy_eval(), so I thought this was safe. What other conflicts do you see?

wlandau commented 3 years ago

Utility function without documentation, relies on file.exists which means assert_path will return without error even if the path is a directory, is that intentional?

Yes, I would rather assert_path() be narrow in scope. Depending on where else these assertions become necessary, I may write an assert_files() or assert_not_dirs() function.

The assert_stan_file function will return without error even if the path to the Stan file is a path to a directory,

I just added a statement in assert_stan_file() to throw an informative error if stan_file is not a directory.

The assert_package function uses requireNamespace which loads the package namespace as a side-effect... that's probably fine although using installed.packages would avoid that side-effect.

I actually just switched to rlang::check_installed() because it has a nice prompt for the user to install missing packages in interactive mode. This function also uses requireNamespace(). I am not sure there is another way to efficiently check for an installed package. installed.packages() is pretty slow.

For utils/condition.R relying on rlang::abort rather than stop would make it possible to return the objects used in whatever function failed the assertion rather than just a text string. I realize actually making that switch might be a fairly extensive set of changes so just mentioning it.

Yeah, I would have definitely used rlang::abort() and rlang::warn() if I had known about them from the beginning.

Confusing code in utils/output.R since output is not defined as a parameter or locally

I thought I already defined output as a parameter here. Just changed the name to output_type to make the meaning clearer.

Stan fit objects can get very large, is there a specific reason for loading all the cached data into the object here? That approach can make it impossible to use the function due to RAM constraints.

It allows tar_stan_mcmc() to define downstream compressed targets for multiple different kinds of output (draws, diagnostics summaries). If we do not save the entire CmdStanMCMC object with all its data, the information vanishes by default. The alternative is to save the CSV files to persistent storage, tell targets to track those output files, and then bootstrap the CmdStanMCMC object using https://github.com/stan-dev/cmdstanr/pull/412 (which was not available when I first wrote tar_stan_mcmc()). But this requires designing a brand new data storage system outside the targets data store (_targets/), which would make projects a bit less portable and a bit more mysterious for users, and those CSV files are uncompressed and could take up a lot of space on disk. (Conversely, stantargets compresses CmdStanMCMC objects with qs and data frames with fst.) Users who hit memory constraints can currently run tar_stan_mcmc_rep_summary() with the default settings of batches = 1 and data = 1.

sakrejda commented 3 years ago

I noticed the changes you made in your usage of rlang and they're helpful. I'm hopeful rlang is more stable than it used to be but it makes sense to be cautious about it. I also like that you've been working with the posterior package rather than re-inventing the wheel on MCMC summaries.

I mis-read tidy_eval as eval_tidy a few times in a row before you pointed out they're different. Not everyone reliably recognizes word order (me included) so a name like tar_eval would make it clear this is a function your package defines rather than something imported.

I'll have some more comments but overall I think this package fills an important niche really well and you're addressing the challenges that are going to come up in terms of maintenance. The evolution of cmdstanr is probably going to keep going but I don't think you'll see huge breaking changes in terms of the arguments you end up passing down so the use of formal arguments seems like a toss-up in terms of costs and benefits.

wlandau commented 3 years ago

Thanks, @sakrejda.

a name like tar_eval would make it clear this is a function your package defines rather than something imported.

Now changed to tar_tidy_eval().

mattwarkentin commented 3 years ago

Package Review

@melvidoni and @wlandau, please see the attached review. This is my first time participating in software peer-review, so I've tried to follow the guidelines as much as possible. Some of my comments may or may not be out-of-scope for this kind of review, or too pedantic. I have reviewed this package primarily for following best practices for R package development, package usability, and for how it fits/works with targets, and not so much as a Bayesian, stan, or cmdstanr expert (which I believe is covered by @sakrejda). I hope this review is helpful. Thanks for including me in the process.

Documentation:

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 25


Review Comments

README

DESCRIPTION/NAMESPACE

Documentation

General

wlandau commented 3 years ago

Thank you for such a thorough review, @mattwarkentin! Even at a first glance, I can already tell this feedback is going to be super useful. I will work on a response soon.

sakrejda commented 3 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 20


Review Comments

I don't have any comments beyond those above, I'm looking forward to finding out how well the package does with models that have very large numbers of parameters but didn't have time to check that out for the review.

wlandau commented 3 years ago

Respnose to @mattwarkentin's review

The documentation of stantargets is so much better now, thanks again @mattwarkentin for your feedback.

goodpractice and covr indicate test coverage of 2.57%, though I believe this is incorrect based on my own inspection.

That's odd. stantargets does skip tests if:

Could that explain it?

I would consider adding a few sentences to the README to really explain that the primary set of functions in stantargets are target factories, and these functions use the information provided to construct a whole set of targets based on standard workflows/best practices for working with cmdstanr and targets, by abstracting away much of the technical details of both. The reason stantargets makes things easy is because it saves you from having to create and manage many tar_target()s manually. I think maybe this point isn't explicitly clear in the README and is important for developing a good mental model of this package. Maybe a simple visualization would be helpful, showing how a single call to a function like tar_stan_mcmc() not only creates many targets, but it creates the entire dependency structure (as shown via tar_visnetwork()).

Addressed in https://github.com/wlandau/stantargets/commit/e27b7552ce8dd3282950bd01090537136cf94ca4. The README now gives a brief explanation and refers readers to a vignette for an example.

Installing cmdstan might be a point of friction for some (though it wasn't for me). You may consider adding some additional text and a link to the README section on Installation for where users can find help if they are having installation issues for cmdstan/cmdstanr since I think it requires a toolchain that users may not already have setup (Example: https://mc-stan.org/cmdstanr/articles/cmdstanr.html)

Addressed in https://github.com/wlandau/stantargets/commit/016dd5bb4b5fc1fee9d7dfd686f6ddee97674b8e.

Related to the above comment, it might be helpful to very briefly explain the dependency tree that stantargets rests on. This may also prevent some confusions for new users about software names and where to find information if installation or other issues arise, since there is stan, cmdstan, cmdstanr, and then stantargets to wrap it all together for the purposes of using targets. Maybe this will be clearer to others, but I found this a little confusing, at least at first.

https://github.com/wlandau/stantargets/commit/712dcf274b45a87c005a10c46735f3eb70c6de5d mentions this in the context of troubleshooting, which is where I have seen confusion come up most often.

This is super picky, but, should it be made clear in the Usage that this is a simplified/toy example to understand the general stantargets workflow and not the expected usage. I don't expect many users would want to generate Stan data or define complex functions directly in their _targets.R script.

Clarified in https://github.com/wlandau/stantargets/commit/0df9acafbe720bc702a260913c9f8b09cf31e372.

It would be great if the Usage section could perhaps point towards some real examples, if they exist, similar to wlandau/targets-keras et al..

Good idea. I adapted https://github.com/wlandau/targets-stan and created a project that will eventually appear at https://github.com/wlandau/stantargets-example-validation. Unfortunately, it will not become public for another week or two because I need permission from my company to release it and the gatekeepers are on vacation.

It might be useful to show the full argument names in the Usage section for tar_stan_mcmc() since this is likely the entry-point to seeing this function used for the first time, and users are unlikely to know the argument order used for positional matching. For example: list( tar_stan_mcmc( name = example, stan_files = "x.stan", data = generate_data() ) )

Done in https://github.com/wlandau/stantargets/commit/0c14616c01aa642292f021db986a50f87abde860.

There is some duplication, and syntactic error, in the below text: tar_read(tar_read(example_summary_x) Run tar_visnetwork() to check _targets.R for correctness, then call tar_make() to run the pipeline. Access the results using tar_read(), e.g. tar_read(tar_read(example_summary_x). Visit this vignette to read more about this example.

Fixed in https://github.com/wlandau/stantargets/commit/f4265305ac05df53543438dc0c5c9883dabe9115.

Consider suppressing the warnings produced by citation("stantargets") to clean up this part of the README:

citation("stantargets")

Fixed in https://github.com/wlandau/stantargets/commit/708379d7e47be609780024960519120464b36e98.

Very minor comment: In the following part of the README.md file, you may consider time-stamping the YouTube video (this video) to direct users precisely to the point in the video where the relevant content starts (i.e. ~6min). The easier the better.

Added in https://github.com/wlandau/stantargets/commit/aa3ecbfd7e8d1d13746a15fa52303e08ccb44cde.

Do the specialized storage formats (e.g. fst, qs) need to be direct Imports for stantargets, or can they be Suggests and only installed optionally by those using these storage formats? I believe this is how things are handled in targets. I tend not to overthink dependencies, but this seems like the right time for scrutiny...Upon further code inspection, I see that fst_tbl and qs are used as default formats throughout stantargets, so maybe these need to stay as Imports.

Yes, stantargets functions automatically choose "fst_tbl" and "qs" so the user does not have to think about storage formats. Since these formats are always used, I think the corresponding packages should be hard dependencies.

You have indicated that you plan to submit stantargets to CRAN. Could you please reconcile how you plan to submit this package to CRAN while two of the Imports, most notably stan-dev/cmdstanr, are currently installed from a remote source?

The CRAN release of stantargets is going to be delayed for as long as it takes cmdstanr and posterior to reach CRAN. The Stan Development Team has expressed the desire to release both these packages eventually, but I do not know exactly when that will happen. But I do plan to publish stantargets as soon as its dependencies are ready.

Only a single function is imported from each of purrr (map()) and fs (path_rel()). Is it critical to Import both of these packages for functions that may have base equivalents that are suitable? This could avoid the requirement to install these two packages (as neither package are listed dependencies for targets)...Actually it looks like purrr::map2_dfr() and is also used but is not formally imported like purrr::map(). Following the above comment, most functions from external packages are imported in tar_stan_package.R, except is seems to be missing purrr::map2_dfr, rlang::syms, rlang::expr, and tarchetypes::tar_combine_raw. Please review.

I fixed some of this in @sakrejda's review. The rest should be fixed in https://github.com/wlandau/stantargets/commit/c83b1d7748a1873c36090ef3e34a288c6ec82b69. I feel stantargets operates at a high enough level where it makes sense to use fs and purrr sometimes, and that commit imports more functions that should have been imported (including fs::path_ext_remove() rather than tools::file_path_sans_ext(), which removes the explicit dependency on tools).

You have listed minimum versions of packages for nearly every dependency. Is this strict versioning necessary? Are these minimum versions based on known breaking changes in previous versions, or based on the authors current set of package versions? The latter is considered bad practice by the rOpenSci guidelines (https://devguide.ropensci.org/building.html#pkgdependencies)

That's a new one for me. And I suspect it may reflect a recent change in the R community's overall thinking. I definitely remember looking at DESCRIPTION files of major Tidyverse packages a few years ago and noticing that they set minimum versions for all the dependencies.

I see where this guideline is coming from, and I gave it serious thought. However, I am reluctant to drop the minimum version requirements of targets or any of its derivatives. Reasons:

  1. Years of bug reports for both drake and targets motivated many of these minimum version requirements. I am certain there were good reasons for rlang, digest, qs, and tibble, but I was not proactive about documenting them at the time, so all the current minimum version requirements are Chesterton's Fences.
  2. As you pointed out, stantargets has a lot of dependencies, which creates a lot of room for things to go wrong. targets frequently gets blamed for problems in other packages, so I end up spending a lot of time on tech support outside the scope of the packages I actually develop. Minimum version requirements help mitigate this.
  3. stantargets belongs in project-oriented workflows where users should be using renv anyway. That should help avoid modifications to the global package environment.

I did take another look at the minimum version requirements of stantargets, and in https://github.com/wlandau/stantargets/commit/b746ecfd695c7d8db14b2e22085a2e8f2bb8608a, I synchronized them with those of targets.

Please consider documenting the data-generating function tar_stan_example_data() using the approach recommended here (https://r-pkgs.org/data.html#documenting-data) for documenting standard package data.

Added a @format tag in https://github.com/wlandau/stantargets/commit/1e9f083a40d712199d35df4526d00e2b61f01443.

Consider adding @family roxygen tags to "families" of functions so they are linked in the documentation, making it easier for users to navigate between functions, perhaps in the same way as the functions are grouped in the stantargets pkgdown website.

Yeah, that's much more convenient than @seealso. Implemented in https://github.com/wlandau/stantargets/commit/1d103b07e45aa20528fcb9496683c920a4252975 to reflect the groupings at https://wlandau.github.io/stantargets/reference/index.html.

I think the @title, @description, and @return sections of some functions could use some modest revisions to improve function clarity. I'll address each in the following comments, using tar_stan_mcmc() (shown below), as an example:

' @title One MCMC per model with multiple outputs.

' @export

' @description Targets to run a Stan model once with MCMC

' and save multiple outputs.

Since functions are generally meant to represent actions, I suggest the @title should be in "future-tense" based on what action will be performed. Since the functions usually just create targets that themselves perform the actions, I think that can be ignored and the function titles can simply describe the actions achieved by the set of child targets, with short and evocative titles being preferred (some current titles may be well-suited as part of the description). The title should not contain any closing punctuation if they are not sentences.

' @title One MCMC per model with multiple outputs.

Might be revised to:

' @title Fit one MCMC per model with multiple outputs

Or perhaps something similar...

I removed terminating punctuation from the @titles as you suggested in https://github.com/wlandau/stantargets/commit/1d103b07e45aa20528fcb9496683c920a4252975.

To me, a title like "fit one MCMC per model with multiple outputs" implies that the function actually does the action instead of creating a target to do it later. I worry this may confuse new users of targets who often struggle to understand that calling tar_target(name, command()) does not actually run command().

Often, the @description for many functions is ostensibly just listing the return value of that function (usually the function is creating a set of targets), as a sentence fragment. I think it would be preferable to use the @description section to expound on what the function does using complete sentences/paragraph styling. Using tar_stan_mcmc(), for example, a naive attempt at a revision might look like:

' @description tar_stan_mcmc() creates a set of target objects that

' fit a single MCMC for each of the provided model files and

' saves multiple outputs containing results of the model fits.

I agree, edited in https://github.com/wlandau/stantargets/commit/a6de8786530c34cbc4e652cc22f786c64dd912fb.

Due to the nature of the stantargets functions (i.e. the functions abstract away technical details and behind-the-scenes create and coordinate multiple targets), I think the @return section is very important for users to have a solid mental model of what happens when these functions are invoked in a pipeline. I might suggest first describing the return value in a general sense before providing an example of a return, as is currently presented. Expanded on in the next two points.

https://github.com/wlandau/stantargets/commit/ec0d5e561abe85445550d19a76a4be26fe4e007f introduces a new @section on target objects. All target factories now inherit this section and refer to it at the beginning of @return. (See also https://github.com/wlandau/stantargets/commit/3822b96bb9cd05fa1add471aaf355a9df5b8e9c7).

For stantargets functions that return lists of targets, I like the bullet point list for each of the offspring targets, but the "descriptions" of the targets are a mix of what the target is used for (reproducibly track the Stan model file.) and what the target will produce (data for the generated quantities.). Consider harmonizing the descriptions or describing each, in turn. See below.

Following from the above two comments, a revised return section for tar_stan_mcmc() et al. might look something like this (I've used a markdown table for organization):

' @return

' tar_stan_mcmc() returns a list of target objects. The names of the

' targets are prefixed by the name argument and suffixed by the

' stan_files argument to create unique target names. A set of

' targets is produced for each model file provided to stan_files,

' except for the data target, which is shared between models.

'

' The table below provides an example of the names of the targets

' produced by tar_stan_gq(name = x, stan_files = "y.stan", ...)

' with a brief description of each target and what objects (if any)

' are produced when these targets are built.

'

' | Name | Description | Object |

' | :--- | :--- | :--- |

' | x_file_y | Track the Stan model file | None |

' | x_lines_y | Track contents of the Stan model file (omitted if compile = "original") | file |

' | x_data | Data for the MCMC | data.frame |

...and so on

https://github.com/wlandau/stantargets/commit/884f36f0113cbcd295d1e54e1b57e4ef9490bcfd describes the returned targets with more consistency and more detail. (See also https://github.com/wlandau/stantargets/commit/c688f1f1e551d1ab685cb86129e5be1797e2cf9f and https://github.com/wlandau/stantargets/commit/65672acaa751b73b50ee4a7524822e09bf35abbd.) Tables are a great idea, but rendered tables in Rd files seem to be borderless, and that may be an issue because of the large volume of text that would be in the table cells in this case.

Function arguments are defined in an inconsistent format within function documentation. Most commonly, they are defined as (type) Description, but also sometimes as Type, description and Type. Description. With so many arguments for most of the functions, I think it would be best to keep a strict and consistent format, for clarity purposes.

Unfortunately, I do not think I can fix this one. The arguments defined with "(type) Description" are inherited directly from the cmdstanr documentation (e.g. @inheritParams cmdstanr::cmdstan_model). The others are implemented directly in stantargets or inherited from targets. To rectify the inconsistency, I would have to change all the inherited arguments in targets, which will probably create other instances of the same inconsistency in other R Targetopia packages.

Again related to function arguments, and I'm not a cmdstan expert, but other than the first few required arguments, and the last few targets-specific arguments, are the "middle" arguments organized in a thematic way? If not, could they be, or even potentially sorted alphabetically, to make it easier to locate an argument? This is a very minor comment and may be based on my lack of familiarity with all of the moving parts of Stan and cmdstanr

The middle arguments are mostly ordered the same way as in the analogous functions in cmdstanr (e.g. tar_stan_mcmc() vs model-method-compile and model-method-sample). I feel this eases the transition from cmdstanr to stantargets. (But bome stantargets arguments are inserted next to their thematically appropriate counterparts, such as stdout and stderr, which are next to quiet.)

Generally speaking, in the signatures of target factories, I order arguments as follows:

  1. Required stantargets-specific arguments.
  2. cmdstanr arguments for compilation.
  3. cmdstanr arguments to run the Bayesian method.
  4. Optional stantargets-specific arguments.
  5. Optional targets-specific arguments.

The 5 groups above are in rough order of priority, and (2) goes before (3) because (3) has a massive number of arguments. Within each group, users of cmdstanr and targets are already accustomed to seeing those arguments next to each other.

There is some name repetition that I observed when running several of the examples. Very minor but could perhaps be updated to avoid any confusions. It happens because tempfile(), by default, names the initial part of the file "file". Using the example in tar_stan_mcmc() to demonstrate:

● start target your_model_file_file1b284ba3414 ● built target your_model_file_file1b284ba3414 ● start target your_model_data ● built target your_model_data ● start target your_model_mcmc_file1b284ba3414 ● built target your_model_mcmc_file1b284ba3414 ● start target your_model_draws_file1b284ba3414 ● built target your_model_draws_file1b284ba3414 ● start target your_model_summary_file1b284ba3414 ● built target your_model_summary_file1b284ba3414 ● start target your_model_diagnostics_file1b284ba3414 ● built target your_model_diagnostics_file1b284ba3414 ● end pipeline Using something like tempfile(pattern = "", fileext = ".stan") makes things cleaner:

● start target your_model_file_X192867a6863e ● built target your_model_file_X192867a6863e ● start target your_model_data ● built target your_model_data ● start target your_model_mcmc_X192867a6863e ● built target your_model_mcmc_X192867a6863e ● start target your_model_diagnostics_X192867a6863e ● built target your_model_diagnostics_X192867a6863e ● start target your_model_summary_X192867a6863e ● built target your_model_summary_X192867a6863e ● start target your_model_draws_X192867a6863e ● built target your_model_draws_X192867a6863e ● end pipeline

Sure, fixed in https://github.com/wlandau/stantargets/commit/1d1d4605d5c49efa62ae96f95b1ccbd0ae1cc1ca.

The example code in tar_stan_vb_rep_summary() (shown below) fails locally for me, producing the following error:

targets::tar_dir({ # tar_dir() runs code from a temporary directory. targets::tar_script({ library(stantargets)

' Do not use temporary storage for stan files in real projects

' or else your targets will always rerun.

path <- tempfile(fileext = ".stan") tar_stan_example_file(path = path) list( tar_stan_vb_rep_summary( your_model, stan_files = path, data = tar_stan_example_data(), batches = 2, reps = 2, stdout = R.utils::nullfile(), stderr = R.utils::nullfile ) ) }, ask = FALSE) targets::tar_make() }) ● start target your_model_batch ● built target your_model_batch ● start target your_model_file_file5cfc269d8f06 x error target your_model_file_file5cfc269d8f06 ● end pipeline Error : 'file' must be NULL or an already open connection Error: callr subprocess failed: 'file' must be NULL or an already open connection Visit https://books.ropensci.org/targets/debugging.html for debugging advice.

Thanks, fixed in https://github.com/wlandau/stantargets/commit/71716f7ac262bf7b6d0dda42aa18f6fa40ecadf5.

Should functions (e.g. tar_stan_mcmc(), etc.) fail if the file(s) passed to stan_files do not exist? This would flag issues when calling functions like tar_validate() to assess "correctness"", which seems like the right time to flag this issue, rather than at the time of trying to build the pipeline. But I am not a Stan expert, so maybe there is a reason to not jump the gun on throwing an error when a file doesn't exist. This is what happens if you tar_make() with a file that doesn't exist.

● start target name_data ● built target name_data ● start target name_file_moo ● end pipeline Error : missing files: model.stan Error: callr subprocess failed: missing files: model.stan Visit https://books.ropensci.org/targets/debugging.html for debugging advice.

I agree, stantargets should error early if the Stan model file does not exist. Fixed in https://github.com/wlandau/stantargets/commit/6f319483bb2dd99521a719703f2cbce79bc20c90 and https://github.com/wlandau/stantargets/commit/480e48a7b8d9e4d77a2835e907d603d8c35a0fa7.

tar_make() and tar_make_future() worked for me without issue in my interactive exploration of the package functionality, but tar_make_clustermq() (both the CRAN and GitHub versions) failed, throwing the following errors: Error in try(self$crew$finalize()) : attempt to apply non-function ● end pipeline Error in self$crew$receive_data() : Trying to receive data without workers Error: callr subprocess failed: Trying to receive data without workers Visit https://books.ropensci.org/targets/debugging.html for debugging advice.

If you have time, could we troubleshoot in a discussion thread? It would help if you would peel back some layers in a small reprex. Could just be a problem with targets or clustermq. Personally, I use stantargets with tar_make_clustermq() a lot at work without this error.

Super pedantic: I note that the trn() function is used in a lot of places, but at first glance it wasn't clear to me its role and it may be confusing for others trying to understand the code. I very lightly suggest considering a more evocative function name. Upon further inspection, I think it stands for "ternary", as in the ternary operator (similar to the one found in JavaScript and other languages). I believe the term ternary operator is a bit of a misnomer, as it only describes that an operator depends on three operands, but does not describe its actual purpose (conditional flow). Also, in this case, the function is not really an operator, just a standard R function.

I will keep this in mind and think about a new name. However, I hesitate to change it immediately because I use trn() in most of my other packages, and updating them all would take time.

Yes, "trn" is short for "ternary". I know the name itself only refers to the fact that it takes 3 arguments, but I still feel historical precedent makes it immediately recognizable. if_else() might also be appropriate, but it risks confusion with the problematically vectorized base::ifelse(). I find it difficult to think of another name as unique, evocative, and concise as "trn".

For completeness, I had internal dialog about the message queue that I ultimately decided was the concern of targets, so I posted a discussion in that repository, found here: ropensci/targets#391

Glad we iterated on that.

sakrejda commented 3 years ago

I will keep this in mind and think about a new name. However, I hesitate to change it immediately because I use trn() in most of my other packages, and updating them all would take time.

Maybe documenting it would help even though it's a utility? Here's what I did during my review:

> library(stantargets)
> ?trn
No documentation for β€˜trn’ in specified packages and libraries:
you could try β€˜??trn’
> trn
Error: object 'trn' not found
> stantargets::trn
Error: 'trn' is not an exported object from 'namespace:stantargets'
> stantargets:::trn
function (condition, x, y) 
{
    if (any(condition)) {
        x
    }
    else {
        y
    }
}
<bytecode: 0x55d183e9a9a8>
<environment: namespace:stantargets>
> 

Bit of a brainfart on my part but doc would help :shrug:

tjmahr commented 3 years ago

just observing the that one alternative name for trn(), based on the source, is ifany()

wlandau commented 3 years ago

Great name, @tjmahr! Changed in https://github.com/wlandau/stantargets/commit/bfa20986ca1ba10d972da067cc0dd6c2507d5670.

mattwarkentin commented 3 years ago

@wlandau Thanks for providing such detailed responses. A few remaining comments:

If you have time, could we troubleshoot in a discussion thread? It would help if you would peel back some layers in a small reprex. Could just be a problem with targets or clustermq. Personally, I use stantargets with tar_make_clustermq() a lot at work without this error.

Yes, I'm happy to try and troubleshoot this issue. I will try to put something together tomorrow when I'm back in the office. It may just be a me issue, but we can find that out.

@melvidoni I'm not sure if you need/want me to "sign off" on anything, but @wlandau has more than adequately addressed all of my comments. I am happy for this review to proceed forward to whatever is next.

wlandau commented 3 years ago

Thanks again for reviewing, @mattwarkentin! The spelling mistake is fixed in https://github.com/wlandau/stantargets/commit/296fcac37713b2e8cb6e8e6ed4379c8910c2c57c.

melvidoni commented 3 years ago

Hello @wlandau. I see you've made great work addressing all comments.

I see that @mattwarkentin has approved the package, what about @sakrejda? What do you think?

sakrejda commented 3 years ago

@melvidoni thanks for the reminder, I wasn't sure how to signal approval, is there something specific? @wlandau addressed all of my comments more than adequately.

melvidoni commented 3 years ago

Approved! Thanks @wlandau for submitting and @mattwarkentin @sakrejda for your reviews!

To-dos for @wlandau:

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.

We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

wlandau commented 3 years ago

Wonderful, @melvidoni! Thank you for serving as editor, and thanks again @mattwarkentin and @sakrejda for reviewing! I will work on these to-dos.

wlandau commented 3 years ago

@melvidoni, I believe I have implemented most of the items you identified. If you would add me as an admin of https://github.com/ropensci/stantargets, I can deactivate the current GitHub Pages setup and migrate to https://docs.ropensci.org.

melvidoni commented 3 years ago

Done, please check now!

wlandau commented 3 years ago

Thanks, @melvidoni. I deactivated GitHub Pages and removed the gh-pages branch. Does anything else need to be done for https://docs.ropensci.org/stantargets/ to appear?

melvidoni commented 3 years ago

Thanks, @melvidoni. I deactivated GitHub Pages and removed the gh-pages branch. Does anything else need to be done for https://docs.ropensci.org/stantargets/ to appear?

Hello, apologies for the delay in coming back to you. There was a wrong GitHub token somewhere so things were stalled. I spoke with the Editors and believe the website should appear today or tomorrow.

jeroen commented 3 years ago

See: https://dev.ropensci.org/job/stantargets/1/console

The docs didn't appear because the fails with:

-- Building articles -----------------------------------------------------------
Writing 'articles/index.html'
Reading 'vignettes/mcmc.Rmd'
-- RMarkdown error -------------------------------------------------------------
Quitting from lines 120-121 (mcmc.Rmd) 
Error : callr subprocess failed: CmdStan path has not been set yet. See ?set_cmdstan_path.
Visit https://books.ropensci.org/targets/debugging.html for debugging advice.
--------------------------------------------------------------------------------
Error : Failed to render RMarkdown

Is this something we should install on the docs server? Or maybe you could do something like this in the vignette:

if(identical(Sys.getenv("IN_PKGDOWN"), "true")){
  cmdstandr::install_cmdstan()
}