ropensci-review-tools / pkgstats

Historical statistics of every R package ever
https://docs.ropensci.org/pkgstats/
17 stars 1 forks source link

Database of all all object references in all CRAN packages #15

Open mpadge opened 3 years ago

mpadge commented 3 years ago

@krlmlr Our discussions about Sourcegraph got me thinking that the routines in this package could be used to generate a database of all object references in all CRAN packages - function calls in R, but also arbitrarily more complex object references in all other src and inst languages. All info is currently extracted in the CRAN archive trawl, yet ultimately disposed in order to summarise all stats for each package as a single vector. The full intermediate results could nevertheless be dumped in a database, the whole thing put in some publicly accessible place, and everyone would have the ability to query object relationships and cross-references within and between all R packages.

I note in particular that the "References" in Sourcegraph seem to be merely text-based, and are not actual object references - the whole system treats code as mere text. With this system we could build a proper Sourcegraph-like system that linked any object (function, class, struct, whatever) to all other references in all CRAN packages. Thoughts?

krlmlr commented 3 years ago

I love the idea of such a database. I think support for R code is most important -- I often want to find uses of methods or functions in other packages.

What would be the size of the database? Should we start with a machine-readable dump into individual files committed to GitHub, and take it from there?

mpadge commented 2 years ago

@krlmlr Interim progress report on this: Can now extract network of all external calls, all done through static analyses. Example via #20 with summary of all external calls from dplyr:

library (pkgstats)
packageVersion ("pkgstats")
#> [1] '0.0.1.6'
u <- "https://cran.r-project.org/src/contrib/dplyr_1.0.7.tar.gz"
path <- file.path (tempdir (),
                   tail (strsplit (u, "\\/") [[1]], 1))
download.file (u, destfile = path)

s <- pkgstats (path)
pkgstats_summary (s)$external_calls
#> [1] "base:654,DBI:3,dplyr:316,generics:22,glue:7,graphics:1,lobstr:3,methods:11,pillar:4,rlang:3,RSQLite:1,stats:5,tidyselect:9,utils:10,vctrs:5"
# Counts of numbers of external calls to different pkgs

# Can be processed to extract further info:
x <- strsplit (pkgstats_summary (s)$external_calls, ",") [[1]]
x <- do.call (rbind, strsplit (x, ":"))
x <- data.frame (pkg = x [, 1],
                 ncalls = as.integer (x [, 2]))
x$ncalls_rel <- round (x$ncalls / sum (x$ncalls), 3)
x <- x [order (x$ncalls, decreasing = TRUE), ]
rownames (x) <- NULL
print (x)
#>           pkg ncalls ncalls_rel
#> 1        base    654      0.620
#> 2       dplyr    316      0.300
#> 3    generics     22      0.021
#> 4     methods     11      0.010
#> 5       utils     10      0.009
#> 6  tidyselect      9      0.009
#> 7        glue      7      0.007
#> 8       stats      5      0.005
#> 9       vctrs      5      0.005
#> 10     pillar      4      0.004
#> 11        DBI      3      0.003
#> 12     lobstr      3      0.003
#> 13      rlang      3      0.003
#> 14   graphics      1      0.001
#> 15    RSQLite      1      0.001

Created on 2021-09-22 by the reprex package (v2.0.0.9000)