Closed shannonpileggi closed 1 week ago
I am not sure I will be able to grasp the internals during R Dev Day, but I would be happy to run numerical experiments to measure the complexity of findInterval()
, then propose a documentation patch based on what I find. Please let me know if this is of interest.
R-devel for a few minutes now contains a version of findInterval()
with 2 new options, checkSorted = TRUE
, and checkNA = TRUE
. If you set (one or both of) these to FALSE
, you can indeed speedup computations (but get wrong results in case you are "lying". {checkNA=FALSE
actually may work on some platforms, it did for me on (Fedora 40) Linux {with default gcc compiler and glibc libraries}.
For R Dev Day, I learned the basics of findInterval()
's algorithm with the help of @lawremi and Robert Gentleman, and I ran benchmarks to examine complexity. My results are for R 4.4.0. Here is the simulation study:
metrics <- NULL
n_data <- 1e8L
data <- runif(n_data)
data_increasing <- sort(data, decreasing = FALSE)
data_decreasing <- sort(data, decreasing = TRUE)
range <- range(data)
for (n_breaks in c(10 ^ seq_len(5))) {
for (rep in seq_len(10)) {
breaks <- seq(from = min(range) - 1, to = max(range) + 1, length.out = n_breaks)
seconds_increasing <- system.time(findInterval(x = data_increasing, vec = breaks))["sys.self"]
seconds_decreasing <- system.time(findInterval(x = data_decreasing, vec = breaks))["sys.self"]
seconds_unsorted <- system.time(findInterval(x = data, vec = breaks))["sys.self"]
new_metrics <- data.frame(
n_breaks = n_breaks,
seconds = c(seconds_increasing, seconds_decreasing, seconds_unsorted),
scenario = c("increasing", "decreasing", "unsorted"),
rep = paste(n_breaks, rep)
)
print(new_metrics)
metrics <- rbind(metrics, new_metrics)
}
}
Complexity is mostly as expected: if we fix the length of x
and vary the length of vec
, we see O(log(N))
complexity when x
is unsorted and O(1)
complexity if x
is sorted in increasing order.
ggplot(
data = filter(metrics, rep > 1L),
mapping = aes(x = log(n_breaks), y = seconds, color = scenario)
) +
scale_y_log10() +
geom_point(position = position_dodge(width = 0.5)) +
theme_gray(20)
There is a slight performance penalty when the initial sort is in decreasing order. Robert hypothesized that there are a few points where the interval search tries a large number of bins before it gets to the correct answer, but then the new choice of bin is the correct initial guess for the next set of points.
ggplot(
data = filter(metrics, n_breaks == max(n_breaks)),
mapping = aes(x = scenario, y = seconds, group = rep)
) +
scale_y_log10() +
geom_point() +
geom_line() +
theme_gray(20)
findInterval()
times in general seem to be faster than the default sort method. Radix of course may be faster.
seconds_sort_increasing <- replicate(
10,
system.time(sort(data, decreasing = FALSE))["sys.self"]
)
hist(seconds_sort_increasing)
seconds_sort_decreasing <- replicate(
10,
system.time(sort(data, decreasing = TRUE))["sys.self"]
)
hist(seconds_sort_decreasing)
I posted a comment at https://bugs.r-project.org/show_bug.cgi?id=16567 linking to https://github.com/r-devel/r-dev-day/issues/44#issuecomment-2292425011.
Looks like no more action to take on this one.
Bug 16567: Claims about computational complexity of findInterval
Initial submission below, see bugzilla for further discussion.
Participants should confirm if MM's 2015 comment about "The plan is for R to change here and "automagically" become fast for checks such as is.unsorted() or is.na(.) --- by memoizing such properties in the corresponding object." ever happened.