Closed colinpmillar closed 4 years ago
Been thinking about this a bit more, and we might have an argument in
taf.bootstrap(force = TRUE)
and set the default to clean=FALSE
clean = TRUE
- delete all data folders and start from scratch
clean = FALSE, force = FALSE
- only run entries that are new
clean = FALSE, force = TRUE
- rerun all scripts but do not delete existing data
The behaviour of is to allow the datasets to be retained (i.e. clean = FALSE) this means when new datasets are added we do not loose the old ones.
Also , it then becomes good practice to write 're-entrant' bootstrap data scripts, that will skip downloading of files that are already there - e.g.. when you are developing a bootsrap script that dowlonads of lots of DATRAS files..
There is some logic to work out here, but it is pretty crippling when a dataset that takes 1 hour or more to download is delete by mistake when you forget to put clean=FALSE
so I think its worth looking into at some point
The current behavior of taf.bootstrap(data=FALSE)
is to allow the datasets to be retained.
That's quite useful and intuitive, but I agree, we can think of more detailed control in the future.
But does it allow new entries to be added? Thats the crucial point here. If there is a large dataset that has already been downloaded, how does a user add a new dataset without either, commenting out the other dataset and setting clean = FALSE, or redownloading the large dataset over and over.
Good points. With clean.data()
and commits up to c75399b, we now have a bootstrap procedure that is considerably smoother than in the past. The default taf.bootstrap()
no longer download files if they are already in place.
On the other hand taf.bootstrap(taf=TRUE)
downloads and installs everything.
This arrangement makes things work fast on user computers, but puts a considerable load on the TAF server. There are simply so many things that can go wrong if the TAF server does not download and install every time, such as changes in remote data sources and version-specific package dependencies. If we want to guarantee healthy runs on the TAF server, we can probably use taf.bootstrap(taf=TRUE)
and increase the number of available cores as necessary.
To select data files that should be fetched again, users just delete files inside bootstrap/data
in their file manager. That is probably the most familiar and intuitive interface to specify which files to delete and fetch again, although people can also use file.remove
if they prefer.
So icesTAF 3.5-0 is ready for a CRAN release, as far as I can tell. Tests, feedback, and ideas are welcome :)
This is worth thinking about before we release 3.5-0.
If we have the server run taf.bootstrap(taf=TRUE)
every time, it might double the server load compared to running the default taf.bootstrap()
. For some analyses, the package installation can take 10x longer than running the scripts, while for others the package installation takes less time than the scripts, and for many analyses there are no packages to install.
In the current development version of icesTAF, the taf=TRUE
flag sets force=TRUE
which has three effects:
The default taf.bootstrap(force=FALSE)
is to skip cases 1-3 if it looks like the files are already in place. The 4th case (copying files from bootstrap/initial/data
to bootstrap/data
) is always enabled, since it comes with no cost. If you edit a data file inside bootstrap/initial/data
, it's always good to get the newest version into bootstrap/data
.
Case 1 is probably an overkill 99% of the time, but we have seen some edge cases. Someone declared a SOFTWARE.bib entry such as casperwberg/surveyIndex@ref
, forgetting to include the subdirectory casperwberg/surveyIndex/surveyIndex@ref
. Even after correcting the SOFTWARE.bib entry, a force=TRUE
was required to have download.github()
overwrite the old surveyIndex_13579bd.tar.gz file. Another edge case is if multifleet FLSAM is first built with the wrong FLCore or stockassessment package dependency and should be rebuilt with force=TRUE
.
Cases 2-3 might be an overkill 95% of the time, but bootstrap scripts may have been modified and online remote resources may have changed.
A fast and somewhat slack policy would be to have the TAF server run with force=FALSE
by default, but allow the user to enable force=TRUE
via a commit message. Given that case 1 is more likely to be an overkill than cases 2-3, we could distinguish force.github
and force.other
although that may seem like too many levers and buttons.
So among the options mentioned above are (a) "fast and slack" allowing users to set force=TRUE
via a commit message, (b) "many buttons" allowing users to set force.github=TRUE
and force.other=TRUE
separately, or (c) the current "safe and slow" approach that guarantees healthy TAF server runs. With longer run times, option C is related to another design decision, whether launching a new run on the TAF server should cancel previous runs of the same analysis.
There's probably no big rush to release 3.5-0 to CRAN yet, so we can test and brainstorm a bit :)
Excellent: with c75399b, datasets are cached by default. To overwrite existing datasets, run taf.bootstrap(taf=TRUE)
.
It would be really useful to mark a dataset as not to be updated during development. Some datasets do not change much and take a long time to download, so during development, if these could be marked as 'cache=TRUE' or similar, then taf.bootstrap() would by default not clean these folders and not re download them.