Plan for presentation - Githubissues

russHyde commented 4 years ago

Plan for newcastle satrdays abstract:

Code analysis tools:

static stuff: code-quality, code-style {goodpractice, lintr, cloc, }
dynamic stuff: benchmarking,
archaeological stuff: gitsum

How to combine all these things together, similar to the code-maat thing

Probably need to work on the visual representation of projects

russHyde commented 4 years ago

Related to russHyde/dupree#38 .

I ran dupree over a set of ten CRAN packages and it takes < 20 secs to analyse each on my computer.

TODO:

[ ] - write up a blogpost where dupree_package() is ran over lots of packages
- [ ] - summarise time taken
- [ ] - summarise code clusters found
- [ ] - Do this for github-accessible packages and do not restrict to popular r-lib / ropensci / rstudio projects (because they're already pretty well engineered)

Doing this will almost certainly find some dupree-breakers: either things that take too long to be analysed, or break the code parsing.

Preferred:

do this over development tools (eg, the CRAN list on https://github.com/ropensci/PackageDevelopment), rather than over statistical / data-manipulation / graphical packages

russHyde commented 4 years ago

Want github repo for each CRAN package on https://github.com/ropensci/PackageDevelopment

We can get all CRAN DESCRIPTION data using the following (updated from Julia Silge's blog: https://juliasilge.com/blog/mining-cran-description/):

library(dplyr)
library(tibble)
cran <- tools::CRAN_package_db()
# the returned data frame has two columns with the same name???
cran <- cran[,-65]
# make it a tibble
cran <- as_tibble(cran)
cran

# There are ~ 5.5k packages that are hosted on github
sum(grepl("github", cran$URL) | grepl("github", cran$BugReports))
[1] 5641

cran_gh <- filter(cran, grepl("github", URL) | grepl("github", BugReports))

Note you could get the github URL for sites directly from the ropensci markdown, but some of those packages will have been dropped from CRAN by now

Can get the packages mentioned in the task view from https://github.com/ropensci/PackageDevelopment/blob/master/PackageDevelopment.ctv

Would need to mine it using xml2, for example Each package is mentioned in the value of a <pkg>...</pkg> tag

russHyde commented 4 years ago

library(xml2)
xml_path <- file.path("https://raw.githubusercontent.com/ropensci/PackageDevelopment/master/PackageDevelopment.ctv")
xml_data <- xml2::read_xml(xml_path)
dev_pkgs <- xml_text(xml_find_all(xml_data, "packagelist/pkg"))

russHyde commented 4 years ago

# 113 packages are still on CRAN
length(intersect(dev_pkgs, cran$Package))

# 82 packages are on CRAN and have a github repo
dev_cran_gh <- filter(cran_gh, Package %in% dev_pkgs)
dim(dev_cran_gh)

russHyde commented 4 years ago

# as of today:
dev_cran_gh$Package
 [1] "aoos"           "aprof"          "argparse"       "assertr"        "available"     
 [6] "backports"      "badgecreatr"    "checkmate"      "checkpoint"     "CodeDepends"   
[11] "covr"           "cranly"         "devtools"       "docopt"         "drat"          
[16] "ensurer"        "formatR"        "functools"      "GetoptLong"     "getPass"       
[21] "git2r"          "gitlabr"        "GRANBase"       "gWidgets2"      "htmlwidgets"   
[26] "hunspell"       "import"         "inline"         "js"             "knitr"         
[31] "later"          "lintr"          "log4r"          "logging"        "matlabr"       
[36] "microbenchmark" "miniCRAN"       "mockr"          "optigrab"       "packagedocs"   
[41] "packrat"        "pacman"         "pipeR"          "pkgconfig"      "pkgdown"       
[46] "pkggraph"       "pkgmaker"       "pkgnet"         "prof.tree"      "profmem"       
[51] "profr"          "progress"       "proto"          "purrr"          "R.oo"          
[56] "R6"             "rcmdcheck"      "Rcpp"           "Rd2roxygen"     "RDocumentation"
[61] "Rdpack"         "remotes"        "reticulate"     "rhub"           "RInside"       
[66] "rJava"          "rlang"          "roxygen2"       "rscala"         "RStata"        
[71] "rstudioapi"     "rtype"          "semver"         "shiny"          "skeletor"      
[76] "sys"            "testit"         "testthat"       "unitizer"       "V8"            
[81] "vdiffr"         "withr"

russHyde commented 4 years ago

Note that some of the above have multiple entries in their URL / BugReports entries (seems funny that gitlabr is hosted on github....)

TODO:

[x] - Use git2r to clone the repo for every one of the above packages
[ ] - For each package
- [x] - Run dupree_package with several choices of min_block_size
- [ ] - Plot out time-taken vs some measure of code complexity (lines of code or number of blocks or size of largest block)
- [ ] - Plot out the reported scores
[ ] - Which package has the most easily removed code???
[ ] - Run a few other code-analysis tools for comparison
- [x] - cloc (keeping only the info re R files) for lines of code
- [ ] - (bobbins?) for distribution of top-level block sizes
- [x] - gitsum
- [ ] - covr
- [ ] - cyclocomp
- [ ] - pkgnet
- [ ] - simpler things? - Plot of # dependencies vs # reverse-dependencies

russHyde commented 4 years ago

see branch analyse-dev-tools

russHyde commented 4 years ago

? perhaps split this out into a separate repo since it's bigger than a single-script analysis; could use drake

russHyde commented 4 years ago

Moved this subjob to separate repo: code_as_data

russHyde / code_as_data

Plan for presentation #1