big matrices? - Githubissues

Adafede commented 4 years ago

I'm having the following question...Is it possible that your code is mishandling "large" data?

Hi, very nice package! (also your other projects 👍🏼 )

I was wondering if your code got some issues since I am almost unable to process a df of more than 500 lines. To be more precise, I usually use upsetR, and there, no problems I get an upSet plot of 500'000 lines in less than 2 sec. With your package and the same data it lasts for ages...

I tried sampling 100 rows of my data, works, 200...works...300...works...400...already long...500 limit then...I have no more patience.

Since your provided example is only 21 rows long I was just wondering... or could it be because of the number of intersects growing too rapidly?

(btw I of course do not consider 500 rows "big data" ;P)

Screenshots / Sketches

None at the moment, can do some if you want

Context

Version: R 4.0.3
Browser: does not apply

Additional context

Thanks a lot!

sgratzl commented 4 years ago

how many columns / sets does your dataset have? can you give me an example R script that I can use as a starting point?

Adafede commented 4 years ago

Hi again,

Sorry for the late reply I tried to give you all the needed info here:

## example for S Gratzl

# loading libraries
library(tidyverse)
library(UpSetR)
library(upsetjs)

# loading example files
toyset_1 <- read_delim(
  file = gzfile("~/Downloads/toyset_1.tsv.gz"),
  delim = "\t",
  escape_double = FALSE,
  trim_ws = TRUE
) %>% 
  data.frame()

toyset_2 <- read_delim(
  file = gzfile("~/Downloads/toyset_2.tsv.gz"),
  delim = "\t",
  escape_double = FALSE,
  trim_ws = TRUE
) %>% 
  data.frame()

# for upsetR aesthetics
count <- toyset_1 %>%
  group_by(attribute1) %>%
  count() %>%
  arrange(attribute1)

# upsetR version of toyset_1
## basic
start <- Sys.time()
upset(
  data = toyset_1,
  sets = c(
    "toy1",
    "toy2",
    "toy3",
    "toy4",
    "toy5",
    "toy6"
  ),
  order.by = "freq",
  set_size.show = TRUE,
  set_size.scale_max = 20000,
)
end <- Sys.time()
cat("Plotted  in", format(end - start), "\n")

## advanced (would be really nice to have such coloring options)
start <- Sys.time()
upset(
  data = toyset_1,
  sets = c(
    "toy1",
    "toy2",
    "toy3",
    "toy4",
    "toy5",
    "toy6"
  ),
  query.legend = "top",
  queries = list(
    list(
      query = elements,
      params = list(
        "attribute1",
        c(
          count[1, 1],
          count[2, 1],
          count[3, 1]
        )
      ),
      active = TRUE,
      color = "#b2df8a",
      query.name = "kin"
    ),
    list(
      query = elements,
      params = list(
        "attribute1",
        c(
          count[3, 1],
          count[2, 1]
        )
      ),
      active = TRUE,
      color = "#1f78b4",
      query.name = "ord"
    ),
    list(
      query = elements,
      params = list(
        "attribute1",
        c(count[3, 1])
      ),
      active = TRUE,
      color = "#a6cee3",
      query.name = "spe"
    )
  ),
  order.by = "freq",
  set_size.show = TRUE,
  set_size.scale_max = 20000
)
end <- Sys.time()
cat("Plotted  in", format(end - start), "\n")

## bigger matrix () 209'301 x 33 (still not that big imho)
start <- Sys.time()
upset(
  toyset_2,
  order.by = "freq",
  set_size.show = TRUE,
  set_size.scale_max = 250000
)
end <- Sys.time()
cat("Plotted  in", format(end - start), "\n")

# upsetjs version of toyset_1
## works nicely
start <- Sys.time()
upsetjs() %>% 
  fromDataFrame(toyset_1[,1:6]) %>% 
  interactiveChart()
end <- Sys.time()
cat("Plotted  in", format(end - start), "\n")

# upsetjs version of toyset_2
## last for ages, no idea why... never had the patience to wait until the end
start <- Sys.time()
upsetjs() %>% 
  fromDataFrame(toyset_2) %>% 
  interactiveChart()
end <- Sys.time()
cat("Plotted  in", format(end - start), "\n")

# Thanks a lot

toyset_1.tsv.gz toyset_2.tsv.gz

If something remains unclear just let me know!

Thank you very much

sgratzl commented 4 years ago

one of the reasons is that UpSetJS doesn't automatically limits the number of (visible) sets. Thus, in the second case, UpSetJS tries to compute and render all 33 sets and all their possible combinations.

whereas UpSsetR seems to limit itself to the top 5 sets by default:

One possible way to compute the combinations and sets yourself and then use the expression input option (https://upset.js.org/integrations/r/articles/basic.html#expression-input) to give it to UpSet.js

sgratzl commented 4 years ago

re coloring: https://upset.js.org/integrations/r/articles/basic.html#queries go into this direction

Adafede commented 4 years ago

Thanks a lot for your answers!

Regarding the top5, it is the default parameter but you can quickly plot all 33 without problems:

> start <- Sys.time()
> upset(
+   toyset_2,
+   order.by = "freq",
+   set_size.show = TRUE,
+   set_size.scale_max = 250000,
+   nsets = 33
+ )
> end <- Sys.time()
> cat("Plotted  in", format(end - start), "\n")
Plotted  in 2.818649 secs

Adafede commented 4 years ago

regarding coloring, I was able to obtain what I wanted thanks to your advice but I am wondering if it would be possible to place the legend elsewhere or to increase export padding since it gets cut when exporting ;(

see here

sgratzl commented 4 years ago

re scalability: I'm happy to include any PR that will improve the scalability, see https://github.com/upsetjs/upsetjs_r/blob/8653ab790b0ff3e32c3f3de1fe2a56fae20ab8e3/r_package/R/data.R#L55-L111 for computing all combinations. I'm not an expert in R such it is quite procedural approach.

Adafede commented 4 years ago

maybe have a look at:

https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html

(make_comb_mat)

or:

https://github.com/hms-dbmi/UpSetR/tree/master/R

(more precisely: https://github.com/hms-dbmi/UpSetR/blob/master/R/upset.R)

sgratzl commented 3 years ago

tested with latest v1.9.0:

> toyset_1 <- read_delim(
+   file = gzfile("./r_package/tests/testthat/data/toyset_1.tsv.gz"),
+   delim = "\t",
+   escape_double = FALSE,
+   trim_ws = TRUE
+ ) %>% 
+   data.frame()
> 
> toyset_2 <- read_delim(
+   file = gzfile("./r_package/tests/testthat/data/toyset_2.tsv.gz"),
+   delim = "\t",
+   escape_double = FALSE,
+   trim_ws = TRUE
+ ) %>% 
+   data.frame()
> # for upsetR aesthetics
> count <- toyset_1 %>%
+   group_by(attribute1) %>%
+   count() %>%
+   arrange(attribute1)
> ## basic
> start <- Sys.time()
> upset(
+   data = toyset_1,
+   sets = c(
+     "toy1",
+     "toy2",
+     "toy3",
+     "toy4",
+     "toy5",
+     "toy6"
+   ),
+   order.by = "freq",
+   set_size.show = TRUE,
+   set_size.scale_max = 20000,
+ )
> end <- Sys.time()
> cat("Plotted  in", format(end - start), "\n")
Plotted  in 1.149752 secs 
> ## works nicely
> start <- Sys.time()
> upsetjs() %>%
+   fromDataFrame(toyset_1[,1:6], c_type="distinctIntersection") %>%
+   interactiveChart()
> end <- Sys.time()
> cat("Plotted  in", format(end - start), "\n")
Plotted  in 0.749186 secs 
> start <- Sys.time()
> upset(
+  toyset_2,
+    order.by = "freq",
+    set_size.show = TRUE,
+    set_size.scale_max = 250000,
+    nsets = 33
+ )
> end <- Sys.time()
> cat("Plotted  in", format(end - start), "\n")
Plotted  in 3.841624 secs 
> start <- Sys.time()
> 
> upsetjs() %>%
+   fromDataFrame(toyset_2, c_type="distinctIntersection", store.elems=FALSE, limit = 40) %>%
+   interactiveChart()
> 
> end <- Sys.time()
> cat("Plotted  in", format(end - start), "\n")
Plotted  in 3.92589 secs

toyset_1 upsetR: Plotted in 1.149752 secs upsetjs: Plotted in 0.749186 secs

toyset_2 upsetR: Plotted in 3.841624 secs upsetjs: Plotted in 3.92589 secs

upsetjs / upsetjs_r

big matrices? #14