pharmaR / riskmetric

Metrics to evaluate the risk of R packages
https://pharmar.github.io/riskmetric/
Other
161 stars 31 forks source link

Number of downloads #4

Closed elong0527 closed 4 years ago

elong0527 commented 5 years ago

We need to define the number of downloads measurement for each R package.

  1. How to define the metric?
    • Suggest option: Average number of monthly downloads from Rstudio/Bioconductor in the past 12 months? (using dlstats R pacakge)
    • Another option per @dgkf is BiocPkgTools::biocDownloadStats()

Discussion point:

In this R package, should we consider to develop this R package in a low risk category? (that means we should be careful to chose which R package to be used to derive the metrics) This is not in high priority at current stage.

matthiazzz commented 5 years ago

Find below a simple script to scrape abitrary download stats from cranlog for a CRAN package.

Only dependency is jsonlite which can me removed as well as it just transforms the JSON repsonse into a dataframe.

As already mentioned, there is a package available who does all this already: https://cranlogs.r-pkg.org/#rpackage

`# Which package, CRAN only pack_name <- "DoseFinding"

require(jsonlite,quietly=TRUE) # to get download stats via fromJSON() and cranlogs

getting last 6 months downloads from rstudio cranlogs

today <- Sys.Date() # today's data duration <- 180 # 180 days equal 6 months url4 <- paste("https://cranlogs.r-pkg.org/downloads/total/",today-duration,":", today,"/",pack_name,sep="") input_file4 <- tempfile() last_month_exists <- 1 # init value last_month <- tryCatch({download.file(url4, input_file4)}, warning = function(w) {last_month_exists<-99}, error = function(e) {cat("error") last_month_exists<-9} ) if (last_month_exists == 1) {
last_month_data <- format(fromJSON(input_file4), big.mark = ",") last_month_text <- paste("The package has been downloaded", last_month_data["downloads"], "times in the last", duration,"days (between",last_month_data["start"], "to",last_month_data["end"],").") } else {last_month_text <- paste("No detailed download data is available for last month.")}

getting total downloads from rstudio cranlogs

from_date <- "2012-10-01" # cranlogs started October 2012 to_date <- Sys.Date() # current data url5 <- paste("https://cranlogs.r-pkg.org/downloads/total/", from_date,":", to_date,"/", pack_name,sep="") input_file5 <- tempfile() total_data_exists <- 1 # init value total_data <- tryCatch({download.file(url5, input_file5)}, warning = function(w) {total_data_exists<-99}, error = function(e) {cat("error") total_data_exists<-9} ) if (total_data_exists == 1) {
total_data <- format(fromJSON(input_file5), big.mark = ",") total_data_text <- paste("The package has been downloaded a total of", total_data["downloads"], " times from the Rstudio servers since the beginning of cranlogs in October 2012.") } else {total_data_text <- paste("No total download data is available.")}

paste(last_month_text, total_data_text) `

matthiazzz commented 5 years ago

For Bioconductor, one can use the above mention function biocDownloadStats().

However, all it does is read in the following table: http://bioconductor.org/packages/stats/bioc/bioc_pkg_stats.tab as one can see i.e. here https://rdrr.io/bioc/BiocPkgTools/src/R/biocDownloadStats.R

Consequently, stats can be derived quite simple:

`# for bioconductor require(dplyr)

takes a few seconds to finish

bioc_downloadsats = read.table('http://bioconductor.org/packages/stats/bioc/bioc_pkg_stats.tab', sep="\t", header = TRUE)

derive overall donwload summaries for all packages

total_stats <- bioc_downloadsats %>% group_by(Package) %>% summarise("Total Downloads" = sum(Nb_of_downloads))

for a specific package

bio_pack <- "limma"

total download

subset(bioc_downloadsats, Package == bio_pack, select=Nb_of_downloads) %>% sum()

downloads in 2018

subset(bioc_downloadsats, Package == bio_pack & Year == 2018, select=Nb_of_downloads) %>% sum() `

dgkf commented 5 years ago

Thanks @matthiazzz, this is very helpful. It seems like we can reduce our dependency footprint by querying the data from the source files directly.

For the time being, I think it might be in our interest to not worry so much about dependencies. It's probably easiest if we focus on the simplest viable mechanisms for now just to experiment quickly. For the time being, I think that using existing package solutions might keep our code logic more immediately obvious as we get the project off the ground. Inevitably some of the dependencies will be a bit heavy for what functionality we need and we can migrate to lighter-weight solutions.

I'm open to alternative approaches, and the decision of when to move from 'experimental' to 'production' is always a tricky one to self identify - how do others feel? Were there other considerations that might drive us to avoid these dependencies?


a small side note; you can use the syntax below for multi-line R code blocks in github-flavored markdown.

```r 
<your code>
elong0527 commented 5 years ago

@matthiazzz , I have provided you the write permission of this repo. Would you want to enhance the R function based on your summary here and create a pull request?

elong0527 commented 4 years ago

will create another issue for biocondocter downloads.