Open russHyde opened 4 years ago
Related to russHyde/dupree#38 .
I ran dupree over a set of ten CRAN packages and it takes < 20 secs to analyse each on my computer.
TODO:
Doing this will almost certainly find some dupree-breakers: either things that take too long to be analysed, or break the code parsing.
Preferred:
Want github repo for each CRAN package on https://github.com/ropensci/PackageDevelopment
We can get all CRAN DESCRIPTION data using the following (updated from Julia Silge's blog: https://juliasilge.com/blog/mining-cran-description/):
library(dplyr)
library(tibble)
cran <- tools::CRAN_package_db()
# the returned data frame has two columns with the same name???
cran <- cran[,-65]
# make it a tibble
cran <- as_tibble(cran)
cran
# There are ~ 5.5k packages that are hosted on github
sum(grepl("github", cran$URL) | grepl("github", cran$BugReports))
[1] 5641
cran_gh <- filter(cran, grepl("github", URL) | grepl("github", BugReports))
Note you could get the github URL for sites directly from the ropensci markdown, but some of those packages will have been dropped from CRAN by now
Can get the packages mentioned in the task view from https://github.com/ropensci/PackageDevelopment/blob/master/PackageDevelopment.ctv
Would need to mine it using xml2, for example
Each package is mentioned in the value of a <pkg>...</pkg>
tag
library(xml2)
xml_path <- file.path("https://raw.githubusercontent.com/ropensci/PackageDevelopment/master/PackageDevelopment.ctv")
xml_data <- xml2::read_xml(xml_path)
dev_pkgs <- xml_text(xml_find_all(xml_data, "packagelist/pkg"))
# 113 packages are still on CRAN
length(intersect(dev_pkgs, cran$Package))
# 82 packages are on CRAN and have a github repo
dev_cran_gh <- filter(cran_gh, Package %in% dev_pkgs)
dim(dev_cran_gh)
# as of today:
dev_cran_gh$Package
[1] "aoos" "aprof" "argparse" "assertr" "available"
[6] "backports" "badgecreatr" "checkmate" "checkpoint" "CodeDepends"
[11] "covr" "cranly" "devtools" "docopt" "drat"
[16] "ensurer" "formatR" "functools" "GetoptLong" "getPass"
[21] "git2r" "gitlabr" "GRANBase" "gWidgets2" "htmlwidgets"
[26] "hunspell" "import" "inline" "js" "knitr"
[31] "later" "lintr" "log4r" "logging" "matlabr"
[36] "microbenchmark" "miniCRAN" "mockr" "optigrab" "packagedocs"
[41] "packrat" "pacman" "pipeR" "pkgconfig" "pkgdown"
[46] "pkggraph" "pkgmaker" "pkgnet" "prof.tree" "profmem"
[51] "profr" "progress" "proto" "purrr" "R.oo"
[56] "R6" "rcmdcheck" "Rcpp" "Rd2roxygen" "RDocumentation"
[61] "Rdpack" "remotes" "reticulate" "rhub" "RInside"
[66] "rJava" "rlang" "roxygen2" "rscala" "RStata"
[71] "rstudioapi" "rtype" "semver" "shiny" "skeletor"
[76] "sys" "testit" "testthat" "unitizer" "V8"
[81] "vdiffr" "withr"
Note that some of the above have multiple entries in their URL / BugReports entries (seems funny that gitlabr is hosted on github....)
TODO:
[x] - Use git2r to clone the repo for every one of the above packages
[ ] - For each package
dupree_package
with several choices of min_block_size
[ ] - Which package has the most easily removed code???
[ ] - Run a few other code-analysis tools for comparison
cloc
(keeping only the info re R files) for lines of codegitsum
covr
cyclocomp
pkgnet
see branch analyse-dev-tools
? perhaps split this out into a separate repo since it's bigger than a single-script analysis; could use drake
Moved this subjob to separate repo: code_as_data
Plan for newcastle satrdays abstract:
Code analysis tools:
How to combine all these things together, similar to the code-maat thing
Probably need to work on the visual representation of projects