colinpmillar commented 4 years ago

It would be really useful to mark a dataset as not to be updated during development. Some datasets do not change much and take a long time to download, so during development, if these could be marked as 'cache=TRUE' or similar, then taf.bootstrap() would by default not clean these folders and not re download them.

colinpmillar commented 4 years ago

Been thinking about this a bit more, and we might have an argument in

taf.bootstrap(force = TRUE)

and set the default to clean=FALSE

behaviour:

clean = TRUE - delete all data folders and start from scratch clean = FALSE, force = FALSE - only run entries that are new clean = FALSE, force = TRUE - rerun all scripts but do not delete existing data

The behaviour of is to allow the datasets to be retained (i.e. clean = FALSE) this means when new datasets are added we do not loose the old ones.

Also , it then becomes good practice to write 're-entrant' bootstrap data scripts, that will skip downloading of files that are already there - e.g.. when you are developing a bootsrap script that dowlonads of lots of DATRAS files..

There is some logic to work out here, but it is pretty crippling when a dataset that takes 1 hour or more to download is delete by mistake when you forget to put clean=FALSE so I think its worth looking into at some point

arni-magnusson commented 4 years ago

The current behavior of taf.bootstrap(data=FALSE) is to allow the datasets to be retained.

That's quite useful and intuitive, but I agree, we can think of more detailed control in the future.

colinpmillar commented 4 years ago

But does it allow new entries to be added? Thats the crucial point here. If there is a large dataset that has already been downloaded, how does a user add a new dataset without either, commenting out the other dataset and setting clean = FALSE, or redownloading the large dataset over and over.

arni-magnusson commented 4 years ago

Good points. With clean.data() and commits up to c75399b, we now have a bootstrap procedure that is considerably smoother than in the past. The default taf.bootstrap() no longer download files if they are already in place.

On the other hand taf.bootstrap(taf=TRUE) downloads and installs everything.

This arrangement makes things work fast on user computers, but puts a considerable load on the TAF server. There are simply so many things that can go wrong if the TAF server does not download and install every time, such as changes in remote data sources and version-specific package dependencies. If we want to guarantee healthy runs on the TAF server, we can probably use taf.bootstrap(taf=TRUE) and increase the number of available cores as necessary.

To select data files that should be fetched again, users just delete files inside bootstrap/data in their file manager. That is probably the most familiar and intuitive interface to specify which files to delete and fetch again, although people can also use file.remove if they prefer.

So icesTAF 3.5-0 is ready for a CRAN release, as far as I can tell. Tests, feedback, and ideas are welcome :)

arni-magnusson commented 4 years ago

This is worth thinking about before we release 3.5-0.

If we have the server run taf.bootstrap(taf=TRUE) every time, it might double the server load compared to running the default taf.bootstrap(). For some analyses, the package installation can take 10x longer than running the scripts, while for others the package installation takes less time than the scripts, and for many analyses there are no packages to install.

In the current development version of icesTAF, the taf=TRUE flag sets force=TRUE which has three effects:

All GitHub resources (data and software) are downloaded and packages are installed.
All URLs (data and software) are downloaded.
All bootstrap scripts (data and software) are run.

The default taf.bootstrap(force=FALSE) is to skip cases 1-3 if it looks like the files are already in place. The 4th case (copying files from bootstrap/initial/data to bootstrap/data) is always enabled, since it comes with no cost. If you edit a data file inside bootstrap/initial/data, it's always good to get the newest version into bootstrap/data.

Case 1 is probably an overkill 99% of the time, but we have seen some edge cases. Someone declared a SOFTWARE.bib entry such as casperwberg/surveyIndex@ref, forgetting to include the subdirectory casperwberg/surveyIndex/surveyIndex@ref. Even after correcting the SOFTWARE.bib entry, a force=TRUE was required to have download.github() overwrite the old surveyIndex_13579bd.tar.gz file. Another edge case is if multifleet FLSAM is first built with the wrong FLCore or stockassessment package dependency and should be rebuilt with force=TRUE.

Cases 2-3 might be an overkill 95% of the time, but bootstrap scripts may have been modified and online remote resources may have changed.

A fast and somewhat slack policy would be to have the TAF server run with force=FALSE by default, but allow the user to enable force=TRUE via a commit message. Given that case 1 is more likely to be an overkill than cases 2-3, we could distinguish force.github and force.other although that may seem like too many levers and buttons.

So among the options mentioned above are (a) "fast and slack" allowing users to set force=TRUE via a commit message, (b) "many buttons" allowing users to set force.github=TRUE and force.other=TRUE separately, or (c) the current "safe and slow" approach that guarantees healthy TAF server runs. With longer run times, option C is related to another design decision, whether launching a new run on the TAF server should cancel previous runs of the same analysis.

There's probably no big rush to release 3.5-0 to CRAN yet, so we can test and brainstorm a bit :)

arni-magnusson commented 4 years ago

Excellent: with c75399b, datasets are cached by default. To overwrite existing datasets, run taf.bootstrap(taf=TRUE).

ices-tools-prod / icesTAF

Option to 'cache' a dataset #12

behaviour: