r-hub / cranlogs

Download Logs from the RStudio CRAN Mirror
https://r-hub.github.io/cranlogs/
Other
80 stars 13 forks source link

Percentiles #16

Open pbreheny opened 8 years ago

pbreheny commented 8 years ago

Feature request: I'm not sure how much work would be involved in implementing this, but I think it would be very useful to have a function to return percentiles for downloads, in order to be able to say things like "package X is in the top 10% of downloaded packages from CRAN".

gaborcsardi commented 8 years ago

Good idea. I don't think it is difficult to implement. You want to help with it? :)

A new SQL (plpgsql) procedure is needed here: https://github.com/metacran/cranlogs.app/blob/master/db/proc.sql

pbreheny commented 8 years ago

Hmm...well, I'm not sure I know enough SQL/JSON to be of much help. Algorithmically, it would seem to require:

  1. Get names of all CRAN packages
  2. Run cran_downloads on that list
  3. Calculate quantiles

2 and 3 are straightforward. 1 is clearly possible, but I wouldn't know how to do it through the SQL/JSON interface. Or perhaps there's a more efficient approach than all this?

HenrikBengtsson commented 2 years ago

EDIT 2021-11-30: Answer to a different question below ... (I've updated it to say fraction instead of quantile)

Since you can get the total download count for all packages by passing packages = NULL ("... for a sum of downloads for all packages."), you could use that for your denominator. Here's the gist:

cran_download_fraction <- function(packages, ...) {
  counts <- cranlogs::cran_downloads(packages = packages, ...)
  total <- cranlogs::cran_downloads(packages = NULL, ...)
  z <- lapply(total$date, FUN = function(.date) {
    x <- subset(counts, date == .date)
    y <- subset(total, date == .date)
    x$fraction <- x$count / y$count
    x[, c("date", "count", "fraction", "package")]
  })
  z <- do.call(rbind, z)
  rownames(z) <- NULL
  z
}

Example:

pkgs <- c("rlang", "digest")
stats <- cran_download_fraction(pkgs, from = "2021-11-10", to = "2021-11-12")
stats
#>         date count    fraction package
#> 1 2021-11-10 86060 0.010044005   rlang
#> 2 2021-11-10 36999 0.004318129  digest
#> 3 2021-11-11 86956 0.011273038   rlang
#> 4 2021-11-11 36907 0.004784650  digest
#> 5 2021-11-12 78391 0.011641753   rlang
#> 6 2021-11-12 32555 0.004834704  digest

stats <- cran_download_fraction(pkgs, when = "last-week")
head(stats)
#>         date count    fraction package
#> 1 2021-11-17 87119 0.011624874   rlang
#> 2 2021-11-17 36247 0.004836681  digest
#> 3 2021-11-18 86853 0.012107869   rlang
#> 4 2021-11-18 37356 0.005207668  digest
#> 5 2021-11-19 72217 0.011277519   rlang
#> 6 2021-11-19 30428 0.004751684  digest

Suggestion

Add argument fraction = FALSE to cran_downloads() and make the above calculations internally.

Maybe fraction = TRUE could even be the default?

Limitation: The above is only for download fraction per day. For anyone who wishes to calculate download fraction for a longer time period, say, per week or per month, will have to do something else.

pbreheny commented 2 years ago

Well, this isn't really returning quantiles (or at least, not what I had in mind). rlang might represent 1.2% of all downloads on 2021-11-17, but I would assume that places it in the 99th percentile among all CRAN packages.

HenrikBengtsson commented 2 years ago

Doh! Fair point. I have no idea what I was thinking. I've updated my comment to say 'fraction' instead of 'quantile'.