ropensci-review-tools / pkgstats

Historical statistics of every R package ever
https://docs.ropensci.org/pkgstats/
17 stars 1 forks source link

Separate date and publication date fields #49

Closed pawelru closed 2 years ago

pawelru commented 2 years ago

Hello,

I am experimenting with pkgstats to analyse the codebase of packages not included in ropensci and I have encountered one issue regarding a date field.

https://github.com/ropensci-review-tools/pkgstats/blob/b4b292a705647d11d6084711130af6d7cac59f20/R/desc-stats.R#L60-L63

In the above you are reading a "Date.Publication" field of DESCRIPTION file and then it's being stored in the "date" column. This is somewhat misleading as I would expect it to read "Date" field which also could be present in that file (see this imperfect search). So maybe two separate fields then?

mpadge commented 2 years ago

Thanks @pawelru. The "Date.Publication" field is inserted by CRAN, and provides the reference value for package publication dates, like in their list sorted by date. The problem with "Date" fields is that they are manually entered, whereas "Date.Publication" is automatic and therefore guaranteed to be reliable. Manually-entered date fields are error prone, and prior to more recent automated checking, subject only to manual inspection to ensure that dates were updated beyond previous dates. The provided "Date" fields need not be the actual date of publication by CRAN, and can indeed preceed that by arbitrary amounts. (Hard to find examples, as you already know, but one is https://cran.r-project.org/web/packages/nnet/index.html, with a "Date.Published" field one day ahead of the manual "Date" field.)

So "Date.Publication" will always exist for CRAN packages, with the else condition of reverting to mtime values only there to enable the desc_stats() function to work on non-CRAN packages. That said, you mentioned that you "encountered one issue", so feel free to elaborate if that response does not address it.

pawelru commented 2 years ago

Thanks @mpadge for the great explanation - it's much clear now.

You are definitely right about the "Date" field that it is inserted manually thus subject for a human error. Nevertheless, some of the packages do practice inserting (and maintenance) of that field. Especially packages that are not yet on CRAN which misses "Date.Publication" field. This is exactly the use case I am facing right now - set of packages hosted on GH only. We are inserting that field manually and we are thinking about dedicated GHAction in the near future to eliminate human factor.

My issue in the current set-up is that this is the only place (to the best of my knowledge) where you interfere with the logic of DESCRIPTION file by storing X field (whatever it is) as Y. Especially when I do have an Y field which is being ignored. That behaviour is not documented (I need to dig into the codebase to understand what's going on) and it could be confusing. I can also imagine the situation where one would look for a "Date.Publication" column and found a "date" instead.

So my user story is: I do have a "Date" field and I am sure that it's correct. How to make it present in the summary of pkgstats?

My proposed solution is to keep all the fields as is (i.e. both "Date.Publication" and "Date") and don't change its label. In case "Date" is missing then it would be empty. That would address two problems: (i) presence of "Date" (if provided) and (ii) keeping labels unchanged. This would make your (as well as end-users) life a little bit easier as there is one simple rule of copying without any exceptions that you have to document and end-user have to get familiar with. What do you think about it?

mpadge commented 2 years ago

@pawelru These are all good thoughts, and in principle I agree with you. The problem is in practice. Yours seems to be an edge case where you are able to claim that your are "sure that [the Date field] is correct." That can never generally be true, and,

We are, in general, anti-Date field because it is automatically added when, say, CRAN builds your package and that Date will always be correct, by definition. So it turns out that manually tending a Date field causes more problems than it solves, because people routinely forget to update it. (from https://github.com/r-lib/usethis/issues/806#issuecomment-509078545)

There are other issues here too. Adding multiple fields to the output of pkgstats_summary() quickely causes bloat. Like this (performed on current CRAN archive data):

x <- readRDS ("pkgstats-results.Rds")
s0 <- object.size (x)
format (s0, units = "Mb")
#> [1] "173.1 Mb"
x$date2 <- x$date
s1 <- object.size (x)
format (s1, units = "Mb")
#> [1] "181.9 Mb"
round (as.numeric (100 * (s1 / s0 - 1)), digits = 2)
#> [1] 5.06

Created on 2022-08-08 by the reprex package (v2.0.1)

And adding only one extra character column bloats the size of an already quite large object by 5%. Adding more fields is not really viable. It could be possible to modify the result of the primary pkgstats() call - before the summary - to add additional fields there, but then the uniqueness of your particular case comes into play. There are other, likely equally plausible cases, like users needing to summarise values from the fields inserted by the remotes package, which include both "Packaged" (as date-only), and "Built" (as machine + date info) fields. The possiblities are endless, as Debian Control Files like R package DESCRIPTION files can be arbitrarily extended. For general usage, the "Date" field should be considered unreliable, and is indeed considered so by the entire r-lib suite of packages, and pkgstats.

My suggestion would be for you to implement an additional routine to extract those data yourself, and manually append them to the pkgstats_summary() results. It can be modified from the following script, which also serves the useful purpose of demonstrating how generally unreliable the "Date" fields are, and conversely how generally reliable the "mtime" values are. Running on a single thread over all of CRAN takes < 30 minutes, so easily re-creatable as a one-off script. Modification for updated local archives is trivial, so updating results would be effectively instantaneous.


Date fields in CRAN packages

This serves as a reference for the accuracy of "Date" fields. This first script generates and saves the data.

flist <- list.files ("/<path>/<to>/<cran_mirror>/tarballs", pattern = "\\.tar\\.gz$", full.names = TRUE)
flist <- normalizePath (flist)
d <- pbapply::pblapply (flist, function (f) {
    ftar <- utils::untar(f, exdir = tempdir(), list = TRUE, tar = "internal")
    # pkgs may have embedded DESC files, but main will always be 1st
    desc <- grep ("DESCRIPTION$", ftar, value = TRUE) [1]
    chk <- utils::untar (f, files = desc, exdir = tempdir ())
    desc_path <- file.path (tempdir (), desc)

    x <- tryCatch (
        data.frame (read.dcf (desc_path)), # data.frame standardises names
        error = function (e) NULL
    )
    out <- data.frame (
        "Date.Publication" = "",
        "Date" = "",
        "mtime" = "")
    if (is.null (x)) {
        return (out)
    }
    if ("Date.Publication" %in% names (x)) out$Date.Publication <- x$Date.Publication
    if ("Date" %in% names (x)) out$Date <- x$Date
    out$mtime <- paste0 (file.info (desc)$mtime)

    unlink (file.path (tempdir (), basename (desc)))

    return (out)
})

d <- do.call (rbind, d)
d <- d [which (nzchar (d$mtime))]
d$Date.Publication <- gsub ("\\s.*$", "", d$Date.Publication)
d$mtime <- gsub ("\\s.*$", "", d$mtime)

library (lubridate)
dp <- ymd (d$Date.Publication)
d$diff_date <- as.integer (dp - ymd (d$Date))
d$diff_mtime <- as.integer (dp - ymd (d$mtime))
saveRDS (d, "dates.Rds")

Results

These values show numbers of dates by which "Date/Publication" values on CRAN packages lead both mtime values for the tarballs, and stated "Date" values in package DESCRIPTION files. Negative values indicate "Date/Publication" > (mtime | Date); positive values indicate "Date/Publication" < (mtime | Date). Positive values should not happen, as any manually-entered dates should have values prior to the "Date/Publication" value inserted by CRAN.

setwd ("/data/mega/code/repos/ropensci-review-tools/pkgstats")
library (tidyr)
library (ggplot2)
d <- readRDS ("dates.Rds")
prop_date <- length (which (!is.na (d$diff_date))) / nrow (d)
message (round (100 * prop_date, digits = 1), "% of packages have a 'Date' field")
#> 48.3% of packages have a 'Date' field

d <- pivot_longer (d, cols = starts_with ("diff_"))
d$name <- gsub ("^diff\\_", "", d$name)
d <- d [which (!is.na (d$value)), c ("name", "value")]
ggplot (d, aes (value, colour = name)) +
    geom_freqpoly () +
    scale_y_continuous (trans = "log10") +
    theme (legend.position = c (0.1, 0.9))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Transformation introduced infinite values in continuous y-axis

Created on 2022-08-08 by the reprex package (v2.0.1)

Date values sometimes exceed the dates at which the packages are published on CRAN by up to 20 years! This analysis can now serve as the definitive reference for pkgstats of the comment that

manually tending a Date field causes more problems than it solves

And for your case @pawelru, the scripts included here hopefully enable you to easily extract the values you require in your specific case, as well as demonstrating that that case does not, unfortunately, translate to the more general case of the entirely of CRAN and beyond.


The next commit will update the documentation, and close this issue, by clarifying what the "Date" field actually holds. Thanks for the opportunity to investigate this important aspect, and I hope that this helps your case.