tdhock / data.table-revdeps

0 stars 1 forks source link

too many downloads from bioconductor? #20

Closed tdhock closed 8 months ago

tdhock commented 8 months ago

https://github.com/tdhock/data.table-revdeps/commit/72c821e8990988eb0a52d9b9426b30768f6617aa is an attempt to fix the issue for which I explain the fix below.

I actually do have a mechanism in place to avoid repetitive downloads, https://github.com/tdhock/data.table-revdeps/blob/master/popular_deps.csv is a file with names of all the frequently used packages, which are only downloaded once per day (rather than once in every one of the 1400+ tasks in my job array). I have added all of the necessary bioconductor packages to this list, so hopefully that helps. There is a tradeoff between installing packages once (which makes a single setup job take more time) versus installing them in each task (which reduces the time of the initial setup job, but increases time for each task). I monitor the number of packages which are installed in each of these tasks, and current max number of tasks with the same install is 79 for a CRAN package (not from bioconductor), you can see that in once of these reports, https://rcdata.nau.edu/genomic-ml/data.table-revdeps/analyze/ section Most installed packages. I checked how many of these packages are from bioconductor, and there are 160 packages, downloaded a total of 1324 per day, which would mean about 40k downloads per month, which seems to be inconsistent with the claimed 1M+ downloads since Jan 1st. I'm not sure I understand "time trials of installs using BiocManager" could you please clarify? I do monitor how long the package installation takes overall in each of my jobs. (some packages in that installation come from Bioc, but most from CRAN) I'm not sure I fully understand the issue, so I was wondering if you could share the contact info of the bioconductor person?

tdhock commented 8 months ago

adding all of the bioc pkgs to popular_deps resulted in params.sh job (re-building R-devel and re-installing all popular deps) going over time limit of 10 hours. Therefore if we need to limit bioc bandwidth, it is no longer feasible to do a full re-build and re-install every day. new solution is to do a re-build of R-devel, but save and restore popular deps in library https://github.com/tdhock/data.table-revdeps/commit/1533c70db54a3934932fa24ad8e8d1ee9e7d65c3