Improve performance of generating distinct interactions

halhen commented 2 years ago

When generating distinct intersections on data with hundreds of thousands of elements, it grinds to a halt. The time seems to be roughly O(n^2), meaning that with double the data execition takes 2^2=4x times as long. With the help of profviz, we find the main source to be a Filter in pushCombination(), which causes a twice nested loop over the elements.

Minimal benchmark on a fairly beefy computer (5950X, 128 GB RAM) on Fedora Linux, R 4.1.3 and upsetjs 1.11.0, git hash 4b375a8e0

generate_data <- function(n) {
  tibble::tibble(
    col_0 = sample(c(0, 1), n, replace = TRUE),
    col_1 = sample(c(0, 1), n, replace = TRUE),
    col_2 = sample(c(0, 1), n, replace = TRUE),
    col_3 = sample(c(0, 1), n, replace = TRUE),
    col_4 = sample(c(0, 1), n, replace = TRUE),
    col_5 = sample(c(0, 1), n, replace = TRUE),
    col_6 = sample(c(0, 1), n, replace = TRUE),
    col_7 = sample(c(0, 1), n, replace = TRUE),
    col_8 = sample(c(0, 1), n, replace = TRUE),
    col_9 = sample(c(0, 1), n, replace = TRUE)
  )
}

Before this PR:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(10000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 24.85004 secs

With this PR:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(10000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 0.7690187 secs

Also, scaling is now closer to O(n) or slightly better. With 10x the data:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(100000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 5.745839 secs

sgratzl commented 2 years ago

thank you.

btw.

fromDataFrame already extract some combinations from the sets. However, by default they are not the distinct ones.
you can customize it using the c_type parameters, e.g

upsetjs() |>
    upsetjs:::fromDataFrame(generate_data(10000), c_type = "distinctIntersections")

should be enough and way faster since it also uses an optimized version of this combination (data frame + distinct)

sgratzl commented 2 years ago

halhen commented 2 years ago

Thanks for the fromDataFrame(c_type) tip -- that solved my immediate performance needs! :tulip:

upsetjs / upsetjs_r

Improve performance of generating distinct interactions #28