Closed Adafede closed 3 years ago
how many columns / sets does your dataset have? can you give me an example R script that I can use as a starting point?
Hi again,
Sorry for the late reply I tried to give you all the needed info here:
## example for S Gratzl
# loading libraries
library(tidyverse)
library(UpSetR)
library(upsetjs)
# loading example files
toyset_1 <- read_delim(
file = gzfile("~/Downloads/toyset_1.tsv.gz"),
delim = "\t",
escape_double = FALSE,
trim_ws = TRUE
) %>%
data.frame()
toyset_2 <- read_delim(
file = gzfile("~/Downloads/toyset_2.tsv.gz"),
delim = "\t",
escape_double = FALSE,
trim_ws = TRUE
) %>%
data.frame()
# for upsetR aesthetics
count <- toyset_1 %>%
group_by(attribute1) %>%
count() %>%
arrange(attribute1)
# upsetR version of toyset_1
## basic
start <- Sys.time()
upset(
data = toyset_1,
sets = c(
"toy1",
"toy2",
"toy3",
"toy4",
"toy5",
"toy6"
),
order.by = "freq",
set_size.show = TRUE,
set_size.scale_max = 20000,
)
end <- Sys.time()
cat("Plotted in", format(end - start), "\n")
## advanced (would be really nice to have such coloring options)
start <- Sys.time()
upset(
data = toyset_1,
sets = c(
"toy1",
"toy2",
"toy3",
"toy4",
"toy5",
"toy6"
),
query.legend = "top",
queries = list(
list(
query = elements,
params = list(
"attribute1",
c(
count[1, 1],
count[2, 1],
count[3, 1]
)
),
active = TRUE,
color = "#b2df8a",
query.name = "kin"
),
list(
query = elements,
params = list(
"attribute1",
c(
count[3, 1],
count[2, 1]
)
),
active = TRUE,
color = "#1f78b4",
query.name = "ord"
),
list(
query = elements,
params = list(
"attribute1",
c(count[3, 1])
),
active = TRUE,
color = "#a6cee3",
query.name = "spe"
)
),
order.by = "freq",
set_size.show = TRUE,
set_size.scale_max = 20000
)
end <- Sys.time()
cat("Plotted in", format(end - start), "\n")
## bigger matrix () 209'301 x 33 (still not that big imho)
start <- Sys.time()
upset(
toyset_2,
order.by = "freq",
set_size.show = TRUE,
set_size.scale_max = 250000
)
end <- Sys.time()
cat("Plotted in", format(end - start), "\n")
# upsetjs version of toyset_1
## works nicely
start <- Sys.time()
upsetjs() %>%
fromDataFrame(toyset_1[,1:6]) %>%
interactiveChart()
end <- Sys.time()
cat("Plotted in", format(end - start), "\n")
# upsetjs version of toyset_2
## last for ages, no idea why... never had the patience to wait until the end
start <- Sys.time()
upsetjs() %>%
fromDataFrame(toyset_2) %>%
interactiveChart()
end <- Sys.time()
cat("Plotted in", format(end - start), "\n")
# Thanks a lot
toyset_1.tsv.gz toyset_2.tsv.gz
If something remains unclear just let me know!
Thank you very much
one of the reasons is that UpSetJS doesn't automatically limits the number of (visible) sets. Thus, in the second case, UpSetJS tries to compute and render all 33 sets and all their possible combinations.
whereas UpSsetR seems to limit itself to the top 5 sets by default:
One possible way to compute the combinations and sets yourself and then use the expression input option (https://upset.js.org/integrations/r/articles/basic.html#expression-input) to give it to UpSet.js
re coloring: https://upset.js.org/integrations/r/articles/basic.html#queries go into this direction
Thanks a lot for your answers!
Regarding the top5, it is the default parameter but you can quickly plot all 33 without problems:
> start <- Sys.time()
> upset(
+ toyset_2,
+ order.by = "freq",
+ set_size.show = TRUE,
+ set_size.scale_max = 250000,
+ nsets = 33
+ )
> end <- Sys.time()
> cat("Plotted in", format(end - start), "\n")
Plotted in 2.818649 secs
regarding coloring, I was able to obtain what I wanted thanks to your advice but I am wondering if it would be possible to place the legend elsewhere or to increase export padding since it gets cut when exporting ;(
see here
re scalability: I'm happy to include any PR that will improve the scalability, see https://github.com/upsetjs/upsetjs_r/blob/8653ab790b0ff3e32c3f3de1fe2a56fae20ab8e3/r_package/R/data.R#L55-L111 for computing all combinations. I'm not an expert in R such it is quite procedural approach.
maybe have a look at:
https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html
(make_comb_mat
)
or:
https://github.com/hms-dbmi/UpSetR/tree/master/R
(more precisely: https://github.com/hms-dbmi/UpSetR/blob/master/R/upset.R)
tested with latest v1.9.0:
> toyset_1 <- read_delim(
+ file = gzfile("./r_package/tests/testthat/data/toyset_1.tsv.gz"),
+ delim = "\t",
+ escape_double = FALSE,
+ trim_ws = TRUE
+ ) %>%
+ data.frame()
>
> toyset_2 <- read_delim(
+ file = gzfile("./r_package/tests/testthat/data/toyset_2.tsv.gz"),
+ delim = "\t",
+ escape_double = FALSE,
+ trim_ws = TRUE
+ ) %>%
+ data.frame()
> # for upsetR aesthetics
> count <- toyset_1 %>%
+ group_by(attribute1) %>%
+ count() %>%
+ arrange(attribute1)
> ## basic
> start <- Sys.time()
> upset(
+ data = toyset_1,
+ sets = c(
+ "toy1",
+ "toy2",
+ "toy3",
+ "toy4",
+ "toy5",
+ "toy6"
+ ),
+ order.by = "freq",
+ set_size.show = TRUE,
+ set_size.scale_max = 20000,
+ )
> end <- Sys.time()
> cat("Plotted in", format(end - start), "\n")
Plotted in 1.149752 secs
> ## works nicely
> start <- Sys.time()
> upsetjs() %>%
+ fromDataFrame(toyset_1[,1:6], c_type="distinctIntersection") %>%
+ interactiveChart()
> end <- Sys.time()
> cat("Plotted in", format(end - start), "\n")
Plotted in 0.749186 secs
> start <- Sys.time()
> upset(
+ toyset_2,
+ order.by = "freq",
+ set_size.show = TRUE,
+ set_size.scale_max = 250000,
+ nsets = 33
+ )
> end <- Sys.time()
> cat("Plotted in", format(end - start), "\n")
Plotted in 3.841624 secs
> start <- Sys.time()
>
> upsetjs() %>%
+ fromDataFrame(toyset_2, c_type="distinctIntersection", store.elems=FALSE, limit = 40) %>%
+ interactiveChart()
>
> end <- Sys.time()
> cat("Plotted in", format(end - start), "\n")
Plotted in 3.92589 secs
toyset_1 upsetR: Plotted in 1.149752 secs upsetjs: Plotted in 0.749186 secs
toyset_2 upsetR: Plotted in 3.841624 secs upsetjs: Plotted in 3.92589 secs
I'm having the following question...Is it possible that your code is mishandling "large" data?
Hi, very nice package! (also your other projects 👍🏼 )
I was wondering if your code got some issues since I am almost unable to process a df of more than 500 lines. To be more precise, I usually use upsetR, and there, no problems I get an upSet plot of 500'000 lines in less than 2 sec. With your package and the same data it lasts for ages...
I tried sampling 100 rows of my data, works, 200...works...300...works...400...already long...500 limit then...I have no more patience.
Since your provided example is only 21 rows long I was just wondering... or could it be because of the number of intersects growing too rapidly?
(btw I of course do not consider 500 rows "big data" ;P)
Screenshots / Sketches
None at the moment, can do some if you want
Context
Additional context
Thanks a lot!