ices-tools-prod / icesTAF

Functions to support the ICES Transparent Assessment Framework
GNU General Public License v3.0
5 stars 7 forks source link

Installation from software.bib does not look at dependencies #31

Open calbertsen opened 2 years ago

calbertsen commented 2 years ago

When running, e.g., taf.bootstrap(), the procedure tries to install R packages listed in the software.bib file, but it does not install missing dependencies, and is not stopped by the error.

arni-magnusson commented 2 years ago

Hi Christoffer! That's a valid point.

The design of TAF is indeed that DATA.bib is a hard requirement - the only valid pathway to use data in a TAF analysis - while SOFTWARE.bib entries are not a hard requirement, especially for R packages.

In other words, a TAF analysis is perfectly valid if it uses R packages without declaring them in SOFTWARE.bib. The main purpose of declaring an R package in SOFTWARE.bib is to specify the exact version number (or SHA code) of a key package that is used in the analysis, typically, a stock assessment package. This is important and useful information for scientific purposes and to strengthen reproducibility. By strengthening reproducibility, we can still not 100% guarantee that it's straightforward to rerun the analysis next year, or in 10 years, running R 7.0 in Windows 14. Some analyses are more reproducible than others and we can usually tell by looking at the scripts - fewer dependencies means better reproducibility.

For example, I was recently involved in a TAF analysis that uses the sraplus package which has a large number of dependencies. From a fresh R install, you need to install nearly 200 packages just to get sraplus to work. It will probably be a challenge to rerun this analysis a few years from now. That's an extreme example, but in ICES assessments we can expect many analyses to start a TAF script with library(tidyverse), for example. The idea in TAF is not to have SOFTWARE.bib install every package used in the TAF analysis, along with all dependencies, but rather to selectively pinpoint the location and version of key software used.

In the case of SPiCT, for example, it could make sense to declare in SOFTWARE.bib not only the version of SPiCT used, but also the matching version of TMB and perhaps the Matrix package. This might be practical to support reproducibility: rerunning an analysis that uses an old version of SPiCT, on a computer that has a newer version of SPiCT installed but the newer version should not be used for this particular analysis.

Given the sraplus example above, and by extension other packages with several layers of dependencies, the taf.bootstrap() procedure should probably not attempt to install all missing dependencies. However, we're very interested in hearing about user experiences with SOFTWARE.bib entries for R packages. Again, these are not compulsory, but provided to (1) make life easier for users when different package versions are used, and (2) support reproducibility, to the extent that is practical.

Do you think, in the analysis you're working on and for your purposes, that it would be enough to declare in SOFTWARE.bib the version of the key software that is used, or do you think it would be useful to also declare the version of some key dependencies as well? We appreciate insights and suggestions from TAF users on this topic.

Based on experience and user feedback on SOFTWARE.bib entries, we can copy parts of this essay and add specific recommendations to the TAF documentation, e.g. https://github.com/ices-taf/doc/wiki/Bib-entries#software-version.

calbertsen commented 2 years ago

Hi Arni,

I completely agree that listing all 200 dependencies of the "tidyverse" in the SOFTWARE.bib is not useful to anyone! My expectation when running the script was that it would install the latest version of the dependencies for me. Similar to what install.packages does for CRAN packages. Of course, that does not ensure reproducibility, but it is convenient for users re-running scripts.

A good alternative, in my opinion, would be to catch the error when a package cannot be installed automatically, stop the script, and give an informative error. Specifically, I was trying to re-run an assessment in TAF, but the script couldn't install FLCore because it was missing the iterators package. The script went on and couldn't install FLSAM because FLCore was missing. It didn't stop until it reached a point where some Rdata file did not exist. It would be easier to fix if the script stopped after the unsuccessful install.packages. Likewise, sourceAll continues with the model script even if the data script fails.

Best, Christoffer