r-hub / cranlogs

Download Logs from the RStudio CRAN Mirror
https://r-hub.github.io/cranlogs/
Other
80 stars 13 forks source link

Option to not count downloads < 1000 bytes #45

Open lindbrook opened 4 years ago

lindbrook commented 4 years ago

To compute the number of package downloads for a given day, cranlogs::cran_downloads() counts the number of entries (number of rows) for that package in CRAN's download logs. Would it be possible to add an optional argument so that observations with sizes less than 1000 bytes do not count toward the number of package downloads?

Two reasons. First, it's hard to say that such observations really represent a package download. Second, while much of this may just be unsuccessful/aborted downloads, I think that some of this is more than random noise.

Using 2019-10-23 as an example, here's what I found.

My back-of-envelope estimate is that around 5% (233,722 / 5,097,912) of all downloads on 2019-10-23 was smaller than 1000 bytes, typically around 500 bytes.

Here's an example. On 2019-10-23 'rstan' was downloaded 2,574 times.

> cranlogs::cran_downloads("rstan", from = "2019-10-23", to = "2019-10-23") date count package 1 2019-10-23 2574 rstan

But if you look at the logs (RStudio's CRAN mirror at http://cran-logs.rstudio.com), you'll see that there are 40 entries smaller than 1000 bytes:

date time size package version country ip_id 1438000 2019-10-23 19:49:15 531 rstan 2.11.1 US 7 1438001 2019-10-23 19:49:15 537 rstan 2.12.1 US 7 1438002 2019-10-23 19:49:15 537 rstan 2.13.2 US 7 1438003 2019-10-23 19:49:15 531 rstan 2.14.1 US 7 1438004 2019-10-23 19:49:15 537 rstan 2.14.2 US 7 1438005 2019-10-23 19:49:15 537 rstan 2.15.1 US 7 1438006 2019-10-23 19:49:15 531 rstan 2.16.2 US 7 1438007 2019-10-23 19:49:15 537 rstan 2.17.2 US 7 1438008 2019-10-23 19:49:15 531 rstan 2.17.3 US 7 1438009 2019-10-23 19:49:15 537 rstan 2.17.4 US 7 1438010 2019-10-23 19:49:15 537 rstan 2.18.1 US 7 1438011 2019-10-23 19:49:15 537 rstan 2.18.2 US 7 1438012 2019-10-23 19:49:15 533 rstan 2.7.0-1 US 7 1438013 2019-10-23 19:49:15 533 rstan 2.8.0 US 7 1438014 2019-10-23 19:49:15 533 rstan 2.8.1 US 7 1438015 2019-10-23 19:49:15 539 rstan 2.8.2 US 7 1438016 2019-10-23 19:49:15 539 rstan 2.9.0-3 US 7 1438017 2019-10-23 19:49:15 539 rstan 2.9.0 US 7 1438121 2019-10-23 19:49:14 537 rstan 2.10.1 US 7 3607030 2019-10-23 10:59:51 534 rstan 2.19.2 <NA> 5 3702500 2019-10-23 20:16:37 537 rstan 2.10.1 US 7 3702501 2019-10-23 20:16:38 537 rstan 2.11.1 US 7 3702502 2019-10-23 20:16:38 537 rstan 2.12.1 US 7 3702503 2019-10-23 20:16:38 531 rstan 2.13.2 US 7 3702504 2019-10-23 20:16:38 531 rstan 2.14.1 US 7 3702505 2019-10-23 20:16:38 537 rstan 2.14.2 US 7 3702506 2019-10-23 20:16:38 537 rstan 2.15.1 US 7 3702507 2019-10-23 20:16:38 531 rstan 2.16.2 US 7 3702508 2019-10-23 20:16:38 537 rstan 2.17.2 US 7 3702509 2019-10-23 20:16:38 537 rstan 2.17.3 US 7 3702510 2019-10-23 20:16:38 531 rstan 2.17.4 US 7 3702511 2019-10-23 20:16:38 531 rstan 2.18.1 US 7 3702512 2019-10-23 20:16:38 537 rstan 2.18.2 US 7 3702513 2019-10-23 20:16:38 539 rstan 2.7.0-1 US 7 3702514 2019-10-23 20:16:39 539 rstan 2.8.0 US 7 3702515 2019-10-23 20:16:39 539 rstan 2.8.1 US 7 3702516 2019-10-23 20:16:39 539 rstan 2.8.2 US 7 3702517 2019-10-23 20:16:39 539 rstan 2.9.0-3 US 7 3702518 2019-10-23 20:16:39 533 rstan 2.9.0 US 7 4186380 2019-10-23 18:29:29 530 rstan 2.19.2 US 7

#

For what it's worth, here's the code for the above log data using packageLog() from the development version of 'packageRank' (v0.3.0.9000) on https://github.com/lindbrook/packageRank

rstan.log <- packageRank::packageLog("rstan", "2019-10-23") vars <- c("date", "time", "size", "package", "version", "country", "ip_id") rstan.log[rstan.log$size < 1000, vars]

#

While 40 of 2574 downloads is small, percentage-wise (1.6%), you'll see that the overwhelming majority of these observations comes from a single IP address that is "downloading" different (possible all) versions of 'rstan'.

While people may, of course, be interested in previous versions of a package and while many people are using network address translation (NAT), this kind of activity is not an isolated event. You'll find it across many packages intermittently throughout the month.

It even extends to "archived" packages (those that are not included on CRAN's main listing). For example, we see that 'bim' was downloaded 12 times:

> cranlogs::cran_downloads("bim", from = "2019-10-23", to = "2019-10-23") date count package 1 2019-10-23 12 bim

But all 12 of those downloads looks like this:

date time size package version country ip_id 1746102 2019-10-23 19:39:06 539 bim 0.92-3 US 7 1746103 2019-10-23 19:39:06 533 bim 0.93-1 US 7 1746105 2019-10-23 19:39:06 537 bim 1.01-1 US 7 1746107 2019-10-23 19:39:06 537 bim 1.01-3 US 7 1746108 2019-10-23 19:39:06 537 bim 1.01-4 US 7 1746109 2019-10-23 19:39:06 537 bim 1.01-5 US 7 1937591 2019-10-23 19:12:29 533 bim 0.92-3 US 7 1937592 2019-10-23 19:12:29 539 bim 0.93-1 US 7 1937593 2019-10-23 19:12:29 537 bim 1.01-1 US 7 1937594 2019-10-23 19:12:29 537 bim 1.01-3 US 7 1937595 2019-10-23 19:12:29 537 bim 1.01-4 US 7 1937596 2019-10-23 19:12:29 537 bim 1.01-5 US 7

#

bim.log <- packageRank::packageLog("bim", "2019-10-23") vars <- c("date", "time", "size", "package", "version", "country", "ip_id") bim.log[, vars]

#

In this case, it's hard to say that this package was really downloaded.

gaborcsardi commented 4 years ago

It turns out that these are HTTP HEAD requests, from an lftp client. I guess it is some feature of lftp. I'll try to update the original script that creates the log files on cran-logs.rstudio.com, to filter these out.

We can think about cleaning up the current DB retrospectively. I guess it would make sense, but it will also change the reports generated in the past, etc.

gaborcsardi commented 4 years ago

It seems that I cannot fix this in the original script. I'll filter them out in my DB update script then. The smallest CRAN package tarball is 1189 bytes currently, and I doubt that it is practically possible to create a smaller one, so 1000 bytes seems like a good limit. Filter must come somewhere here: https://github.com/r-hub/cranlogs.app/blob/4ee2355d0739abb4c4d273e4e891a3a01a6165bf/db/update.sh#L30

lindbrook commented 4 years ago

If possible, it might be better to leave the logs as "raw" as possible for those who might be interested, and just update/amend the DB. Since I've put in some time on this, if you want, I'm willing to write up my findings as a note/vignette to explain the discrepancy. Essentially, what's happening is that beyond the random HTTP requests to individual packages, each Wednesday (+ additional days) someone is making requests to all packages on CRAN (all active current versions and all archived inactive past versions).

Let me just check if I need to original DB data before you update it.

gaborcsardi commented 4 years ago

If possible, it might be better to leave the logs as "raw" as possible for those who might be interested

However without the additional information that these are HEAD requests and not downloads, those log entries are misleading.

I don't think this needs any writeup, especially because the situation may change any time. I think this issue is a good place to record the issue.

Let me just check if I need to original DB data before you update it.

It will not happen overnight, I'll probably migrate the service before this, so no worries just yet.

lindbrook commented 4 years ago

I'm fine with filtering out these records. But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?

For what it's worth, since yesterday was a Wednesday, here's the top of the filtered log for cranlogs:

vars <- c("date", "time", "size", "package", "version", "country", "ip_id")

log.filtered <- packageRank::packageLog("cranlogs", filter = TRUE) head(log.filtered[, vars])

date time size package version country ip_id 246036 2019-11-13 13:26:43 27060 cranlogs 2.1.1 US 14760 325235 2019-11-13 19:29:32 24417 cranlogs 2.1.1 US 18769 340204 2019-11-13 18:08:12 24417 cranlogs 2.1.1 US 572 872939 2019-11-13 20:31:34 26972 cranlogs 2.1.1 FR 34238 928117 2019-11-13 14:00:44 24287 cranlogs 2.1.1 TR 3298 963961 2019-11-13 09:09:16 27163 cranlogs 2.1.1 JP 42282

And all 7 of the records that are filtered out:

log.audit <- packageRank::packageLog("cranlogs", filter = -1000) log.audit[, vars]

date time size package version country ip_id 714794 2019-11-13 20:22:52 527 cranlogs 2.0.0 US 18559 714795 2019-11-13 20:22:52 528 cranlogs 2.1.0 US 18559 1459096 2019-11-13 19:59:32 527 cranlogs 2.0.0 US 18559 1459097 2019-11-13 19:59:32 528 cranlogs 2.1.0 US 18559 2123148 2019-11-13 08:39:09 0 cranlogs 2.1.1 US 2499 3988585 2019-11-13 07:41:52 529 cranlogs 2.1.1 DE 27828 4143146 2019-11-13 19:22:07 528 cranlogs 2.1.1 US 18559

gaborcsardi commented 4 years ago

But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?

Every request is in the log I think, so yes. But aborted downloads are rare. The reasons I can think of:

  1. HEAD request
  2. request with If-modified-*, that results a 304 response
  3. aborted downloads, but even these probably send more than ~500 bytes, as far as the web server is concerned.

I think 3. is probably very rare. 1. should be filtered out, ideally, but 2. is a proper download (attempt).

lindbrook commented 4 years ago

Would you happen to know if the lftp client is part of R and/or RStudio?

gaborcsardi commented 4 years ago

I don't think so.

lindbrook commented 4 years ago

A little more digging...

~500 byte downloads manifest themselves in three ways.

1) as standalone entries:

log <- packageRank::packageLog("cholera", "2020-06-03") log[log$time == "18:48:17", ]

date time size r_version r_arch r_os package version country 24 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.2.1 US 25 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.3.0 US 26 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.4.0 US 27 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.0 US 28 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.1 US 29 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.0 US 30 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.5 US ip_id 24 4724 25 4724 26 4724 27 4724 28 4724 29 4724 30 4724

2) as part of a pair (same machine, same time stamp): a ~500 byte entry plus a full download.

log <- packageRank::packageLog("sp", "2020-05-06") log[log$time == "08:14:34", ]

date time size r_version r_arch r_os package version country 4263 2020-05-06 08:14:34 367 <NA> <NA> <NA> sp 1.4-1 <NA> 4264 2020-05-06 08:14:34 1134627 <NA> <NA> <NA> sp 1.4-1 <NA> ip_id 4263 6 4264 6

3) as part of a triplet (same machine, same time stamp). For packages >= 1 MB in size, it seems you'll see a ~500 byte download, a full download, and an intermediate-sized download:

log <- packageLog("cholera", "2020-02-02") log[log$time == "10:19:17", ]

date time size r_version r_arch r_os package version country 5 2020-02-02 10:19:17 4147305 <NA> <NA> <NA> cholera 0.7.0 US 6 2020-02-02 10:19:17 34821 <NA> <NA> <NA> cholera 0.7.0 US 7 2020-02-02 10:19:17 539 <NA> <NA> <NA> cholera 0.7.0 US ip_id 5 1047 6 1047 7 1047

For packages < 1 MB, it seems you'll see a ~500 byte download plus two "full" downloads:

log <- packageLog("cranlogs", "2020-05-06") log[log$time == "09:03:46", ]

date time size r_version r_arch r_os package version country 19 2020-05-06 09:03:46 529 <NA> <NA> <NA> cranlogs 2.1.1 US 20 2020-05-06 09:03:46 24273 <NA> <NA> <NA> cranlogs 2.1.1 US 21 2020-05-06 09:03:46 24280 <NA> <NA> <NA> cranlogs 2.1.1 US ip_id 19 3678 20 3678 21 3678

While we can easily correct for the first two cases by simply filtering out entries smaller than say 1000 bytes, the variability in size of the intermediate download makes triplets harder to deal with. That said, I'm wondering if you'd consider implementing a fix for them.

Here are three reasons why. First, we're essentially counting one download as three. Second, their frequency in the logs seems to be increasing. Third, because it is a somewhat computationally intensive task, it makes more sense to deal with it on the back-end than as part of a function on the user end.

For what it's worth, to get a ballpark empirical sense of the problem I wrote some prototype R code and done some preliminary analysis. If interested, I can post it.