r-devel / r-dev-day

Repo to organize tasks for R Dev Days
7 stars 2 forks source link

Bug 16567: Claims about computational complexity of findInterval #44

Closed shannonpileggi closed 1 week ago

shannonpileggi commented 3 months ago

Bug 16567: Claims about computational complexity of findInterval

Initial submission below, see bugzilla for further discussion.

Participants should confirm if MM's 2015 comment about "The plan is for R to change here and "automagically" become fast for checks such as is.unsorted() or is.na(.) --- by memoizing such properties in the corresponding object." ever happened.

The documentation of findInterval states:

"...  the internal algorithm uses interval
     search ensuring O(n * log(N)) complexity where ‘n <- length(x)’
     (and ‘N <- length(vec)’).  For (almost) sorted ‘x’, it will be
     even faster, basically O(n)."

In fact the algorithm used to do the interval search is O(N + n log(N)), due to an initial check for sorted order and NAs. This was noted some time ago:

https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html

Note that even if one argues that "internal" in the sense of an internal call down to compiled code, the sentence is still incorrect because both anyNA and is.unsorted result in internal calls.

------------------------------------

The ideal resolution to this (IMO) would be a parameter option to remove the checks. There are many HPC use-cases where sorted vectors are maintained and fast binary search is desired.
wlandau commented 2 months ago

I am not sure I will be able to grasp the internals during R Dev Day, but I would be happy to run numerical experiments to measure the complexity of findInterval(), then propose a documentation patch based on what I find. Please let me know if this is of interest.

mmaechler commented 2 months ago

R-devel for a few minutes now contains a version of findInterval() with 2 new options, checkSorted = TRUE, and checkNA = TRUE. If you set (one or both of) these to FALSE, you can indeed speedup computations (but get wrong results in case you are "lying". {checkNA=FALSE actually may work on some platforms, it did for me on (Fedora 40) Linux {with default gcc compiler and glibc libraries}.

wlandau commented 2 months ago

For R Dev Day, I learned the basics of findInterval()'s algorithm with the help of @lawremi and Robert Gentleman, and I ran benchmarks to examine complexity. My results are for R 4.4.0. Here is the simulation study:

metrics <- NULL
n_data <- 1e8L
data <- runif(n_data)
data_increasing <- sort(data, decreasing = FALSE)
data_decreasing <- sort(data, decreasing = TRUE)
range <- range(data)
for (n_breaks in c(10 ^ seq_len(5))) {
  for (rep in seq_len(10)) {
    breaks <- seq(from = min(range) - 1, to = max(range) + 1, length.out = n_breaks)
    seconds_increasing <- system.time(findInterval(x = data_increasing, vec = breaks))["sys.self"]
    seconds_decreasing <- system.time(findInterval(x = data_decreasing, vec = breaks))["sys.self"]
    seconds_unsorted <- system.time(findInterval(x = data, vec = breaks))["sys.self"]
    new_metrics <- data.frame(
      n_breaks = n_breaks,
      seconds = c(seconds_increasing, seconds_decreasing, seconds_unsorted),
      scenario = c("increasing", "decreasing", "unsorted"),
      rep = paste(n_breaks, rep)
    )
    print(new_metrics)
    metrics <- rbind(metrics, new_metrics)
  }
}

Complexity is mostly as expected: if we fix the length of x and vary the length of vec, we see O(log(N)) complexity when x is unsorted and O(1) complexity if x is sorted in increasing order.

ggplot(
  data = filter(metrics, rep > 1L),
  mapping = aes(x = log(n_breaks), y = seconds, color = scenario)
) +
  scale_y_log10() +
  geom_point(position = position_dodge(width = 0.5)) +
  theme_gray(20)

Screenshot 2024-08-15 at 6 31 35 PM

There is a slight performance penalty when the initial sort is in decreasing order. Robert hypothesized that there are a few points where the interval search tries a large number of bins before it gets to the correct answer, but then the new choice of bin is the correct initial guess for the next set of points.

ggplot(
  data = filter(metrics, n_breaks == max(n_breaks)),
  mapping = aes(x = scenario, y = seconds, group = rep)
) +
  scale_y_log10() +
  geom_point() +
  geom_line() +
  theme_gray(20)

Screenshot 2024-08-15 at 6 31 53 PM

findInterval() times in general seem to be faster than the default sort method. Radix of course may be faster.

seconds_sort_increasing <- replicate(
  10,
  system.time(sort(data, decreasing = FALSE))["sys.self"]
)
hist(seconds_sort_increasing)

Screenshot 2024-08-15 at 6 39 06 PM

seconds_sort_decreasing <- replicate(
  10,
  system.time(sort(data, decreasing = TRUE))["sys.self"]
)
hist(seconds_sort_decreasing)

Screenshot 2024-08-15 at 6 39 13 PM

wlandau commented 2 months ago

I posted a comment at https://bugs.r-project.org/show_bug.cgi?id=16567 linking to https://github.com/r-devel/r-dev-day/issues/44#issuecomment-2292425011.

hturner commented 1 week ago

Looks like no more action to take on this one.