r-lib / lintr

Static Code Analysis for R
https://lintr.r-lib.org
Other
1.19k stars 183 forks source link

Revisit linters included in the list of efficiency linters? #2653

Closed IndrajeetPatil closed 3 weeks ago

IndrajeetPatil commented 3 weeks ago

I am surprised by some linters included under efficiency tag.

library(lintr)
names(linters_with_tags("efficiency"))
#>  [1] "any_duplicated_linter"            "any_is_na_linter"                
#>  [3] "boolean_arithmetic_linter"        "consecutive_mutate_linter"       
#>  [5] "fixed_regex_linter"               "if_switch_linter"                
#>  [7] "ifelse_censor_linter"             "inner_combine_linter"            
#>  [9] "length_test_linter"               "lengths_linter"                  
#> [11] "list_comparison_linter"           "literal_coercion_linter"         
#> [13] "matrix_apply_linter"              "nested_ifelse_linter"            
#> [15] "nrow_subset_linter"               "nzchar_linter"                   
#> [17] "outer_negation_linter"            "redundant_equals_linter"         
#> [19] "redundant_ifelse_linter"          "regex_subset_linter"             
#> [21] "routine_registration_linter"      "sample_int_linter"               
#> [23] "scalar_in_linter"                 "seq_linter"                      
#> [25] "sort_linter"                      "string_boundary_linter"          
#> [27] "undesirable_function_linter"      "undesirable_operator_linter"     
#> [29] "unnecessary_concatenation_linter" "unnecessary_lambda_linter"       
#> [31] "vector_logic_linter"              "which_grepl_linter"

Created on 2024-09-05 with reprex v2.1.1


IMO, the following linters shouldn't be in this category:

x <- list(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))

bench::mark(
  lapply(x, sum) > 10,
  unlist(lapply(x, sum)) > 10,
  iterations = 1000L
)
#> # A tibble: 2 × 6
#>   expression                       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lapply(x, sum) > 10           1.31µs   1.48µs   631356.        0B        0
#> 2 unlist(lapply(x, sum)) > 10   1.68µs   1.93µs   437505.        0B        0

Created on 2024-09-05 with reprex v2.1.1


WDYT? Maybe you also a few others that in hindsight stick out.

MichaelChirico commented 3 weeks ago

Hmm, I am surprised about list_comparison_linter(). FWIW, I would not use unlist() as the comparison -- the idea is to encourage vapply() instead of implicit coercion with >. That said, lapply() is surprisingly competitive despite the implicit coercion:

x = replicate(10000, sample(10, 3), simplify=FALSE)
microbenchmark(times = 1000L, lapply(x, sum) > 10, sapply(x, sum, USE.NAMES=FALSE) > 10, vapply(x, sum, 1) > 10, unlist(lapply(x, sum)) > 10)
# Unit: milliseconds
#                                    expr      min       lq     mean   median       uq       max neval cld
#                     lapply(x, sum) > 10 3.426255 3.597885 4.952596 3.686916 3.910366  24.69174  1000   a
#  sapply(x, sum, USE.NAMES = FALSE) > 10 3.657206 3.822736 5.259227 3.927971 4.182691 160.00862  1000   a
#                  vapply(x, sum, 1) > 10 3.489346 3.643905 5.064258 3.748941 3.971946 157.50112  1000   a
#             unlist(lapply(x, sum)) > 10 3.470735 3.630335 5.345501 3.720350 3.942800 153.28562  1000   a
MichaelChirico commented 3 weeks ago

I agree undesirable_function_linter()/undesirable_operator_linter() should be dropped.

And sure, let's drop list_comparison_linter() -- it's clearly not a big efficiency penalty to use it.

IndrajeetPatil commented 3 weeks ago

seq_linter() also doesn't seem like a good fit here 🤔

library(bench)

x <- data.frame(a = rnorm(1e6))

bench::mark(
  1:nrow(x), 
  seq_len(nrow(x))
)[1:5]
#> # A tibble: 2 × 5
#>   expression            min   median `itr/sec` mem_alloc
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 1:nrow(x)          1.11µs   1.23µs   666808.        0B
#> 2 seq_len(nrow(x))   1.11µs   1.23µs   698924.        0B

Created on 2024-09-06 with reprex v2.1.1

MichaelChirico commented 2 weeks ago

That's not the only case in mind for seq_linter():

microbenchmark(times=100, seq(1, 1e7), seq_len(1e7))
# Unit: nanoseconds
#            expr  min   lq     mean median     uq    max neval cld
#   seq(1, 1e+07) 3640 3706 10314.14 3771.0 3901.0 633721   100   a
#  seq_len(1e+07)  121  150   862.12  161.5  220.5  67561   100   a

(I guess we don't cover that yet, but we should...)

IndrajeetPatil commented 2 weeks ago

Indeed, we don't cover that case.

library(lintr)
lint(text = "seq(1L, 1e7L)", linters = all_linters())
#> ℹ No lints found.

Created on 2024-09-07 with reprex v2.1.1


How should we proceed here? Do we still leave the seq_linter() under efficiency tag, considering this future inclusion?

In that case, I can close my PR.