ropensci-review-tools / pkgstats

Historical statistics of every R package ever
https://docs.ropensci.org/pkgstats/
17 stars 1 forks source link

R build
status codecov Project Status:
Active CRAN_Status_Badge CRAN
Downloads

pkgstats

Extract summary statistics of R package structure and functionality. Not all statistics of course, but a good go at balancing insightful statistics while ensuring computational feasibility. pkgstats is a static code analysis tool, so is generally very fast (a few seconds at most for very large packages). Installation is described in a separate vignette.

What statistics?

Statistics are derived from these primary sources:

  1. Numbers of lines of code, documentation, and white space (both between and within lines) in each directory and language
  2. Summaries of package DESCRIPTION file and related package meta-statistics
  3. Summaries of all objects created via package code across multiple languages and all directories containing source code (./R, ./src, and ./inst/include).
  4. A function call network derived from function definitions obtained from the code tagging library, ctags, and references (“calls”) to those obtained from another tagging library, gtags. This network roughly connects every object making a call (as from) with every object being called (to).
  5. An additional function call network connecting calls within R functions to all functions from other R packages.

The primary function, pkgstats(), returns a list of these various components, including full data.frame objects for the final three components described above. The statistical properties of this list can be aggregated by the pkgstats_summary() function, which returns a data.frame with a single row of summary statistics. This function is demonstrated below, including full details of all statistics extracted.

Demonstration

The following code demonstrates the output of the main function, pkgstats, using an internally bundled .tar.gz “tarball” of this package. The system.time call demonstrates that the static code analyses of pkgstats are generally very fast.

library (pkgstats)
tarball <- system.file ("extdata", "pkgstats_9.9.tar.gz", package = "pkgstats")
system.time (
    p <- pkgstats (tarball)
)
##    user  system elapsed 
##   1.701   0.124   1.802
names (p)
## [1] "loc"            "vignettes"      "data_stats"     "desc"          
## [5] "translations"   "objects"        "network"        "external_calls"

The result is a list of various data extracted from the code. All except for objects and network represent summary data:

p [!names (p) %in% c ("objects", "network", "external_calls")]
## $loc
## # A tibble: 3 × 12
## # Groups:   language, dir [3]
##   language dir   nfiles nlines ncode  ndoc nempty nspaces nchars nexpr ntabs
##   <chr>    <chr>  <int>  <int> <int> <int>  <int>   <int>  <int> <dbl> <int>
## 1 C++      src        3    365   277    21     67     933   7002     1     0
## 2 R        R         19   3741  2698   536    507   27575  94022     1     0
## 3 R        tests      7    348   266    10     72     770   6161     1     0
## # … with 1 more variable: indentation <int>
## 
## $vignettes
## vignettes     demos 
##         0         0 
## 
## $data_stats
##           n  total_size median_size 
##           0           0           0 
## 
## $desc
##    package version                date license
## 1 pkgstats     9.9 2022-05-12 11:41:22   GPL-3
##                                                                                      urls
## 1 https://docs.ropensci.org/pkgstats/,\nhttps://github.com/ropensci-review-tools/pkgstats
##                                                       bugs aut ctb fnd rev ths
## 1 https://github.com/ropensci-review-tools/pkgstats/issues   1   0   0   0   0
##   trl depends                                                        imports
## 1   0      NA brio, checkmate, dplyr, fs, igraph, methods, readr, sys, withr
##                                                                         suggests
## 1 hms, knitr, pbapply, pkgbuild, Rcpp, rmarkdown, roxygen2, testthat, visNetwork
##   enhances linking_to
## 1       NA      cpp11
## 
## $translations
## [1] NA

The various components of these results are described in further detail in the main package vignette.

Overview of statistics and the pkgstats_summary() function

A summary of the pkgstats data can be obtained by submitting the object returned from pkgstats() to the pkgstats_summary() function:

s <- pkgstats_summary (p)

This function reduces the result of the pkgstats() function to a single line with 95 entries, represented as a data.frame with one row and that number of columns. This format is intended to enable summary statistics from multiple packages to be aggregated by simply binding rows together. While 95 statistics might seem like a lot, the pkgstats_summary() function aims to return as many usable raw statistics as possible in order to flexibly allow higher-level statistics to be derived through combination and aggregation. These 95 statistics can be roughly grouped into the following categories (not shown in the order in which they actually appear), with variable names in parentheses after each description. Some statistics are summarised as comma-delimited character strings, such as translations into human languages, or other packages listed under “depends”, “imports”, or “suggests”. This enables subsequent analyses of their contents, for example of actual translated languages, or both aggregate numbers and individual details of all package dependencies, as demonstrated immediately below.

Package Summaries

Information from DESCRIPTION file

Numbers of entries in each the of the last two kinds of items can be obtained from by a simple strsplit call, like this:

deps <- strsplit (s$suggests, ", ") [[1]]
length (deps)
## [1] 9
print (deps)
## [1] "hms"        "knitr"      "pbapply"    "pkgbuild"   "Rcpp"      
## [6] "rmarkdown"  "roxygen2"   "testthat"   "visNetwork"

Numbers of files and associated data

Statistics on lines of code

Statistics on individual objects (including functions)

These statistics all refer to “functions”, but actually represent more general “objects,” such as global variables or class definitions (generally from languages other than R), as detailed below.

Network Statistics

The full structure of the network table is described below, with summary statistics including:

External Call Statistics

The final column in the result of the pkgstats_summary() function summarises the external_calls object detailing all calls make to external packages (including to base and recommended packages). This summary is also represented as a single character string. Each package lists total numbers of function calls, and total numbers of unique function calls. Data for each package are separated by a comma, while data within each package are separated by a colon.

s$external_calls
## [1] "base:447:78,brio:7:1,dplyr:7:4,fs:4:2,graphics:10:2,hms:1:1,igraph:3:3,pbapply:1:1,pkgstats:99:60,readr:8:5,stats:16:2,sys:13:1,tools:2:2,utils:10:7,visNetwork:3:2,withr:5:1"

This structure allows numbers of calls to all packages to be readily extracted with code like the following:

calls <- do.call (
    rbind,
    strsplit (strsplit (s$external_call, ",") [[1]], ":")
)
calls <- data.frame (
    package = calls [, 1],
    n_total = as.integer (calls [, 2]),
    n_unique = as.integer (calls [, 3])
)
print (calls)
##       package n_total n_unique
## 1        base     447       78
## 2        brio       7        1
## 3       dplyr       7        4
## 4          fs       4        2
## 5    graphics      10        2
## 6         hms       1        1
## 7      igraph       3        3
## 8     pbapply       1        1
## 9    pkgstats      99       60
## 10      readr       8        5
## 11      stats      16        2
## 12        sys      13        1
## 13      tools       2        2
## 14      utils      10        7
## 15 visNetwork       3        2
## 16      withr       5        1

The two numeric columns respectively show the total number of calls made to each package, and the total number of unique functions used within those packages. These results provide detailed information on numbers of calls made to, and functions used from, other R packages, including base and recommended packages.

Finally, the summary statistics conclude with two further statistics of afferent_pkg and efferent_pkg. These are package-internal measures of afferent and efferent couplings between the files of a package. The afferent couplings (ca) are numbers of incoming calls to each file of a package from functions defined elsewhere in the package, while the efferent couplings (ce) are numbers of outgoing calls from each file of a package to functions defined elsewhere in the package. These can be used to derive a measure of “internal package instability” as the ratio of efferent to total coupling (ce / (ce + ca)).

There are many other “raw” statistics returned by the main pkgstats() function which are not represented in pkgstats_summary(). The main package vignette provides further detail on the full results.

The following sub-sections provide further detail on the objects, network, and external_call items, which could be used to extract additional statistics beyond those described here.

Code of Conduct

Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.