r-hub / cranlogs

Download Logs from the RStudio CRAN Mirror
https://r-hub.github.io/cranlogs/
Other
80 stars 13 forks source link

cranlogs::cran_downloads() overcounts downloads on 8 days at end of 2012 and beginning of 2013 #64

Open lindbrook opened 2 years ago

lindbrook commented 2 years ago

This is for posterity's sake but I hope it'll be fixed.

For eight days at end of 2012 and the beginning of 2013, cranlogs::cran_downloads() returns counts that are double or even triple of what they should be. I'm fairly confident of this conclusion because the numbers I get are derived by directly downloading the logs from RStudio and counting the number of log entries.

The code for my analysis:

library(cranlogs)
library(packageRank)

start.date <- "2012-10-01"
end.date <- "2013-01-05"

# The expression below uses 'cranlogs' to compute the total number of
# downloads for all of CRAN on the dates above:

cranlogs.data <- cranlogs::cran_downloads(from = start.date, to = end.date)

# This code below uses 'packageRank' and the "raw" RStudio logs to compute
# the total number of download for all CRAN packages on the dates above.

# There are two functions to note: fixDate_2012(), which is part of
# 'packageRank' but is not exported (not in namespace) and
# packageRank::fetchCranLog().

# fixDate_2012() fixes mis-labelled filenames/URL and duplicate logs

fixDate_2012 <- function(date = "2012-12-31") {
  if (class(date) != "Date") ymd <- as.Date(date)
  else ymd <- date
  if (format(ymd, "%Y") == "2012") {
    if (ymd %in% as.Date(c("2012-12-29", "2012-12-30", "2012-12-31"))) {
      stop("Log for ", ymd, " is missing/unavailable.", call. = FALSE)
    } else if (ymd >= as.Date("2012-10-13") & ymd <= as.Date("2012-12-28")) {
      ymd <- ymd + 3
    } else if (ymd %in% as.Date(c("2012-10-11", "2012-10-12"))) {
      if (identical(ymd, as.Date("2012-10-11"))) {
        ymd <- as.Date("2012-10-12")
      } else if (identical(ymd, as.Date("2012-10-12"))) {
        ymd <- as.Date("2012-10-14")
      }
    }
  }
  ymd
}

# packageRank::fetchCranLog(date, memoization = FALSE)
# retrieves logs by their "literal" or exact filename/URL

d <- seq(from = as.Date(start.date), to = as.Date(end.date), by = "day")

packageRank.data <- vapply(d, function(x) {
  tmp <- try(packageRank::fetchCranLog(fixDate_2012(x), TRUE), silent = TRUE)
  if (any(class(tmp) == "try-error")) 0L
  else nrow(tmp[!is.na(tmp$package), ])
}, integer(1L))

packageRank.data <- data.frame(date = d, count = packageRank.data)

# Merge the two data frames by calendar date:
cran.data <- merge(cranlogs.data, packageRank.data, by = "date")
names(cran.data)[-1] <- c("cranlogs", "packageRank")

# Compute the ratio of counts of 'cranlogs' to 'packageRank'
cran.data$ratio <- cran.data$cranlogs / cran.data$packageRank

# If you take a look at `cran.data`, you'll see that generally,
# you get the same exact results for both methods except for
# 8 discrepancies or errors:

errors <- cran.data[cran.data$cranlogs != cran.data$packageRank, ]

# > errors
#          date cranlogs packageRank    ratio
# 6  2012-10-06    13630        6815 2.000000
# 7  2012-10-07       50          25 2.000000
# 8  2012-10-08      170          85 2.000000
# 11 2012-10-11      388         194 2.000000
# 87 2012-12-26    80738       26910 3.000297
# 88 2012-12-27    49007       24501 2.000204
# 89 2012-12-28    21959       10979 2.000091
# 93 2013-01-01    21822       10911 2.000000

The ratio of these differences are generally whole numbers. This leads me to believe that there may be computational errors in 'cranlogs'.

1) I'm not sure what's going on with "2012-10-06".

2) I believe that problem with "2012-10-07", "2012-10-08" and ""2012-10-11" stem from the fact that those logs for are actually duplicated in the RStudio logs.

Nominal          Actual log in file/URL
2012-10-07 ----- 2012-10-07
2012-10-11 ----- 2012-10-07

2012-10-08 ----- 2012-10-08
2012-10-13 ----- 2012-10-08

2012-10-12 ----- 2012-10-11
2012-10-15 ----- 2012-10-11

This overcounting makes sense because, as you wrote in issue #54, you rely on the data in the files and not the filenames/URLs. By doing so, you may have ended up double counting.

3) I haven't sorted out what's going on with the 4 remaining dates ("2012-12-26", "2012-12-27", "2012-12-28", "2013-01-01") but I'm guessing it has something to do with the fact that they surround the 3 missing/lost RStudio logs ("2012-12-29", "2012-12-30", "2012-12-31").

Note that the ratios for the three December dates are not whole numbers. However, I did a sanity check using the top six packages for each of the three days; they all returned whole number multiples. If useful, I can provide more details.

lindbrook commented 2 years ago

For what it's worth, I've patched this in packageRank::cranDownload() using fixCranlogs(). When any of the 8 days are queried, the function recomputes the counts using a stored copy of those 8 days' logs (an R list object named "rstudio.logs").