Closed pawelru closed 2 months ago
@pawelru First of all please note you are referring throughout to open-source software maintained willingly by nice people trying to do nice things. Asserting in an issue title that a package or function "is inaccurate*, and then claiming as your opening sentence,
I have found the calculation of external calls quite inaccurate
is neither nice nor respectful. I am tempted to close this issue for this reason alone, but fortunately for you am intrigued by what your report, and think it is worth further consideration here. In future, please do try to be more respectful towards voluntary open-source maintainers.
Some points at the outset:
checkglobals
package asserts that it "approximately detect[s] global and imported functions."pkgstats
considers code from any and all possible locations, including /inst
and /vignettes
directories. contrary to what you say above.ctags
has been maintained for years, with over 10,000 commits, and years of very detailed debugging and refining. (I have also extensively used tree-sitter
, which I am confident would generate similar results to ctags
, yet is less suited to the task here.)On to your comparisons:
teal.reporter
The 4 lines tagged by checkglobals
are:
# in teal_reporter.R
TealReportCard <- R6::R6Class( # nolint: object_name.
classname = "TealReportCard",
inherit = teal.reporter::ReportCard,...
# in teal_reporter.R
TealSlicesBlock <- R6::R6Class( # nolint: object_name_linter.
classname = "TealSlicesBlock",
inherit = teal.reporter:::TextBlock, ...
# in init.R
reporter <- teal.reporter::Reporter$new()$set_id(attr(filter, "app_id"))
# in reporter_previewer_module.R
checkmate::assert_true(all(names(server_args) %in% names(formals(teal.reporter::reporter_previewer_srv))))
Of these four, pkgstats
only tags the single instance in init.R
, which is clearly a call to a function. The first two references identified by checkglobals
are only inheritance templates for an R6 class. These are not function calls, and aspects of the ineheritance may or may not even be called upon during code execution. ctags
appropriately does not identify these as function calls. The final instance, in reporter_previewer_module.R
, is also clearly not an actual call to a teal.reporter
function, rather just a line used to extract the formal arguments to that function. Again, ctags
appropriately excludes this, and ultimately from these four only identifies one as a function call. I entirely agree with that.
This case is more interesting. The use of renv
is in teal_lockfile.R
:
renv_logs <- utils::capture.output(
renv::snapshot(
lockfile = lockfile_path,
prompt = FALSE,
force = TRUE
# type = is taken from renv::settings$snapshot.type()
)
)
ctags
in this case entirely fails to detect the renv::snapshot
call, yet picks up all other lines. I'm not sure why this may be, and may either represent a genuine tagging omission from ctags
, or it may be that it appropriately identifies that function as being called there only for the side-effect of capturing output? Even I am unsure whether or not I would consider that a genuine function call.
checkglobals
detects 8 calls to shinyjs
, while pkgstats
only detects two. These two are:
# in module_snapshot_manager.R
shinyjs::click(id = "teal-wunder_bar-show_snapshot_manager", asis = TRUE)
# in module_init_data.R
if (getOption("teal.show_js_log", default = FALSE)) {
shinyjs::showLog()
}
Those are obviously direct function calls, and should be tagged as such, which they are. The additional checkglobals
output not tagged by ctags
includes function calls from module_init_data.R
and include_css_js.R
. The latter file includes two shinyjs
function names in this definition:
include_teal_css_js <- function() {
tagList(
shinyjs::useShinyjs(),
include_css_files(),
# init.js is executed from the server
include_js_files(except = "init.js"),
shinyjs::hidden(icon("fas fa-gear")), # add hidden icon to load font-awesome css for icons
)
}
These are not tagged by ctags
, which I suspect is appropriate here, as they are merely passed as arguments to a tagList()
function. Those arguments then may or may not actually be called, so it is appropriate for a static code analyser to exclude those from identified function calls. The other call in that file is in these lines:
run_js_files <- function(files) {
checkmate::assert_character(files, min.len = 1, any.missing = FALSE)
lapply(files, function(file) {
shinyjs::runjs(paste0(readLines(system.file("js", file, package = "teal", mustWork = TRUE)), collapse = "\n"))
})
invisible(NULL)
}
ctags
does then entirely miss that definition, and only tags the call to lapply
, without parsing any of the content of what is applied to each file
. So in that case I would agree with you that ctags
, and therefore pkgstats
, does miss one external call there. I'll keep this issue open to investigate that further.
More generally, I hope that this has made you aware that identifying function calls is not a trivial exercise. Note in general that my tendency would be to place greater trust in anything that provides lower rather than higher estimates of numbers of function calls. Especially in a dynamically-typed and functional language like R, many references to functions will generally not be function calls, and should not be tagged as such. ctags
here does indeed do a better job of identifying such instances than checkglobals
. I'll report back on the one example above, which should also give insight into other discrepancies.
Thanks @mpadge for a prompt and very detailed response - I really appreciate that. Especially that you actually went beyond into the referenced codebase and explain individual examples. Very much appreciated.
Let me begin by saying that I never intended to offend anyone in my previous message. My aim was to simply share some feedback on things that I found concerning. I believe that the constructive feedback is essential, especially in the context of open-sourcing. By this, I mean reports, not criticism. I am sorry if my words were misunderstood.
I totally agree with you that this is not a trivial problem. My background is that I am looking for a measures of a dependency of a package. My first idea was to measure the usage of the objects and I am currently experimenting with different tools. Your message makes me realise that this (ctags
) is meant for something different - i.e. counting external function calls which in some cases makes a big difference - e.g. whether or not to include foo::bar
. On that note, I think that ctags
results are correct. As you explained in details. I was just looking for something different and unintentionally misunderstood the results.
No worries @pawelru, and thank you for your considered response here. Good to hear that you generally understand what I was trying to set out in the detailed response above. The occasional missing function calls from ctags
are nevertheless something I've been grappling with for a long time here, so i'll keep this issue open to dig a bit further here.
@pawelru I'm going to close this now, but will continue to address the issue via #4, where I've already pinged you. Thanks for finally giving me the incentive to pursue that!
@pawelru I've implemented tree-sitter parsing of external calls in #62, which currently gives these results for your package:
path <- "/<local>/<path>/<to>/teal"
has_tabs <- FALSE
pkg_name <- basename (path)
# The current pkgstats approach with 'ctags':
tags_r <- withr::with_dir (path, get_ctags ("R", has_tabs))
ex <- pkgstats:::external_call_network (tags_r, path, pkg_name)
ex <- ex [which (!ex$package == "base"), ]
nrow (ex)
#> [1] 495
# New tree-sitter parsing from code in #62:
tags_treesitter <- get_treesitter_tags (path)
ex <- pkgstats:::external_call_network (tags_treesitter, path, pkg_name)
ex <- ex [which (!ex$package == "base"), ]
nrow (ex)
#> [1] 395
ex <- data.frame (checkglobals::check_pkg (path))
ex <- ex [which (!ex$package == "base"), ]
nrow (ex)
#> [1] 159
Created on 2024-09-13 with reprex v2.1.1
That suggests current approach here is actually okay. The tree-sitter values may be presumed to provide a "true" reference. I'll investigate further over in #62
Thank you for follow ups on this and apologies for a delayed answer. I was on a short break and right afterwards had to focus on something different. Now I am back into this and it looking very very promising. Thanks again. Even though the totals are somewhat similar I actually have found some differences. Please find below some top picks. For each of these, I have found treesitter to be closer to the truth.
Please find below a few examples that I was able to check quickly. I'm using the updated local copy of a referenced repo so the overall numbers might be slightly different to yours but the difference is rather minimal (e.g. I have 511 and you got 495). I'm yet to check it against checkglobals. This was my starting point - to check the top picks
r$> ex_r |> group_by(package, call) |> summarise(N = n()) |> arrange(-N)
`summarise()` has grouped output by 'package'. You can override using the `.groups` argument.
# A tibble: 269 × 3
# Groups: package [22]
package call N
<chr> <chr> <int>
1 base list 69
2 base c 67
3 utils data 56
4 teal modules 51
5 base lapply 36
6 base names 27
7 base args 25
8 base sprintf 21
9 base class 20
10 base if 20
# ℹ 259 more rows
# ℹ Use `print(n = ...)` to see more rows
r$> ex_t |> group_by(package, call) |> summarise(N = n()) |> arrange(-N)
`summarise()` has grouped output by 'package'. You can override using the `.groups` argument.
# A tibble: 343 × 3
# Groups: package [26]
package call N
<chr> <chr> <int>
1 base c 103
2 base sprintf 73
3 base length 63
4 base list 62
5 logger log_debug 57
6 base names 55
7 base inherits 48
8 base lapply 47
9 base is.null 41
10 base attr 36
# ℹ 333 more rows
# ℹ Use `print(n = ...)` to see more rows
interestingly, base::list()
got a very similar results - 69 vs 62. I skipped the analysis of this.
foo <- function(data, ...) {...}
or bar(data = data, ...)
- ctags recognises this as a utils::data
data
arguments
r$> ex_r |> filter(package == "utils" & call == "data") |> nrow()
[1] 56
r$> ex_t |> filter(package == "utils" & call == "data") |> nrow()
[1] 6
modules
arguments. This one is a even more tricky because there is an exported modules()
function.
r$> ex_r |> filter(package == "teal" & call == "modules") |> nrow()
[1] 51
r$> ex_t |> filter(package == "teal" & call == "modules") |> nrow()
[1] 0
(now I'm guessing the pattern) if(length(x) > 0)
- ctags does not recognise base::length
r$> ex_r |> filter(package == "base" & call == "length") |> nrow()
[1] 7
r$> ex_t |> filter(package == "base" & call == "length") |> nrow()
[1] 63
here I'm out of ides why:
r$> ex_r |> filter(package == "logger" & call == "log_debug") |> nrow()
[1] 0
r$> ex_t |> filter(package == "logger" & call == "log_debug") |> nrow()
[1] 57
https://github.com/search?q=repo%3Ainsightsengineering%2Fteal%20logger%3A%3Alog_debug&type=code
lapply(filter, function(x) {...})
- ctags recognises this as a stats::filter
r$> ex_r |> filter(package == "stats" & call == "filter") |> nrow()
[1] 16
r$> ex_t |> filter(package == "stats" & call == "filter") |> nrow()
[1] 0
These are just top diffs that I checked. Even though the sum is comparable - there are actually some differences - both positive and negative. Now I don't want you to go into the details - that's not the point of my message. What I just wanted to say here is that based on that brief analysis, the treesitter approach looks very very promising!
Thanks @pawelru for pursuing this further. I've implemented a full treesitter function tagger in currently experiment package, pkgsimil
(in even more efficient form). You just need to install that package, and run:
pkgsimil::pkgsimil_tag_fns(path)
Just note that things are likely to change quickly in that package, including function names and potentially name of package. But the underlying algorithm works really well, and identifies every single function call.
That will eventually find it's way over here, but doing so will require rebuilding our entire CRAN database to update function call data. So that will have to wait a while ...
I have found the calculation of external calls quite inaccurate. Please find below a few examples.
Reading the source code I am aware that it's probably an imperfection of external dependent
ctags
. It might do well other things (like accounting internal vars) but for counting dependent packages I have found it not detecting occurrences which results in undercounting totals or not including some at all.It's difficult to ask you to fix it and I'm just sharing my findings with you. Maybe you can consider alternative ways of getting what's needed? I'm aware of relatively new
treesitter
but I personally have no experience in it. Maybecheckglobals
show here?Checking a randomly selected
teal
repo that I'm working on:pkgstats
gives me this:(where
x
is defined as per the vignette)checkglobals
(I'm not an author) reports me this:I'm aware of some differences, e.g. you include
R/
dir only whilecheckglobals
look into all dirs (incl. tests, vignettes etc.) but nevertheless the latter seems to be more accurate to me. A few examples using imperfect regexp on my local clone just to show you why I think that the values are incorrect:teal.reporter
dependencycheckglobals
reports 4 uniquepkgstats
reports 1 uniqueR/
pkgstats
reports just one unique occurencerenv
checkglobals
reports 1 uniquepkgstats
reports 0R/
R//teal_lockfile.R: renv::snapshot( R//teal_lockfile.R: # type = is taken from renv::settings$snapshot.type()
❯ grep -r "^[^#].*shinyjs::" R/
R//utils.R: shinyjs::runjs(script) # function does not return anything R//include_css_js.R: shinyjs::runjs(paste0(readLines(system.file("js", file, package = "teal", mustWork = TRUE)), collapse = "\n")) R//include_css_js.R: shinyjs::useShinyjs(), R//include_css_js.R: shinyjs::hidden(icon("fas fa-gear")), # add hidden icon to load font-awesome css for icons R//module_snapshot_manager.R: shinyjs::click(id = "teal-wunder_bar-show_snapshot_manager", asis = TRUE) R//module_init_data.R: shinyjs::showLog() R//module_init_data.R: shinyjs::disable(selector = sprintf(".teal-body:has('#%s') .nav li a", session$ns("content"))) R//module_init_data.R: shinyjs::enable(selector = sprintf(".teal-body:has('#%s') .nav li a", session$ns("content"))) R//module_init_data.R: shinyjs::hide( R//module_init_data.R: shinyjs::runjs(