krassowski / complex-upset

A library for creating complex UpSet plots with ggplot2 geoms
MIT License
466 stars 28 forks source link

General performance in big(ish) data #137

Open 16mc1r opened 3 years ago

16mc1r commented 3 years ago

Is your feature request related to a problem? Please describe. Using upsetR i could get plots from big(ish) data with ~5 million x 600 sized tables. complex upset yields no result (within time i was willing to wait). Yes the dimensions are silly, but I cannot change the data structure or complexity I get.

Describe the solution you'd like Without deep knowledge how the interaction sets are computed a solution based on data.table or matrix permutation. Possibly a way to provide pre-computed interaction matrices, or interaction sets.

Describe alternatives you've considered Keep using upsetR, maybe making my on version which lets me deliever pre computed matrices to plot.

Context (required) ComplexUpset version: x.x.x ‘1.3.1’

R version details ```R $platform [1] "x86_64-w64-mingw32" $arch [1] "x86_64" $os [1] "mingw32" $system [1] "x86_64, mingw32" $status [1] "" $major [1] "4" $minor [1] "0.5" $year [1] "2021" $month [1] "03" $day [1] "31" $`svn rev` [1] "80133" $language [1] "R" $version.string [1] "R version 4.0.5 (2021-03-31)" $nickname [1] "Shake and Throw" ```
R session information ```R R version 4.0.5 (2021-03-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server x64 (build 14393) Matrix products: default locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] odbc_1.3.2 openxlsx_4.2.4 DBI_1.1.1 [4] tidylog_1.0.2 skimr_2.1.3 arrow_5.0.0 [7] tarchetypes_0.2.1 targets_0.6.0 data.table_1.14.0 [10] aokaux_0.6.1 here_1.0.1 glue_1.4.2 [13] cyphr_1.1.2 keyring_1.2.0 ggsci_2.9 [16] naniar_0.6.1 cowplot_1.1.1 Hmisc_4.5-0 [19] Formula_1.2-4 survival_3.2-10 lattice_0.20-41 [22] janitor_2.1.0 kableExtra_1.3.4 knitr_1.33 [25] datapasta_3.1.0 forcats_0.5.1 stringr_1.4.0 [28] dplyr_1.0.7 purrr_0.3.4 readr_1.4.0 [31] tidyr_1.1.3 tibble_3.1.2 ggplot2_3.3.5 [34] tidyverse_1.3.1 pacman_0.5.1 loaded via a namespace (and not attached): [1] colorspace_2.0-2 ellipsis_0.3.2 visdat_0.5.3 [4] rprojroot_2.0.2 snakecase_0.11.0 htmlTable_2.2.1 [7] base64enc_0.1-3 fs_1.5.0 rstudioapi_0.13 [10] farver_2.1.0 bit64_4.0.5 fansi_0.4.2 [13] lubridate_1.7.10 xml2_1.3.2 codetools_0.2-18 [16] splines_4.0.5 jsonlite_1.7.2 broom_0.7.8 [19] cluster_2.1.1 dbplyr_2.1.1 png_0.1-7 [22] compiler_4.0.5 httr_1.4.2 tictoc_1.0.1 [25] backports_1.2.1 assertthat_0.2.1 Matrix_1.3-2 [28] cli_3.0.1 htmltools_0.5.1.1 tools_4.0.5 [31] igraph_1.2.6 gtable_0.3.0 Rcpp_1.0.7 [34] cellranger_1.1.0 vctrs_0.3.8 svglite_2.0.0 [37] xfun_0.22 ps_1.6.0 rvest_1.0.0 [40] lifecycle_1.0.0 scales_1.1.1 clisymbols_1.2.0 [43] hms_1.1.0 parallel_4.0.5 sodium_1.1 [46] RColorBrewer_1.1-2 yaml_2.2.1 gridExtra_2.3 [49] UpSetR_1.4.0 ggplot2movies_0.0.1 rpart_4.1-15 [52] latticeExtra_0.6-29 stringi_1.6.2 checkmate_2.0.0 [55] zip_2.1.1 repr_1.1.3 rlang_0.4.11 [58] pkgconfig_2.0.3 systemfonts_1.0.1 evaluate_0.14 [61] patchwork_1.1.1 labeling_0.4.2 htmlwidgets_1.5.3 [64] bit_4.0.4 tidyselect_1.1.1 processx_3.5.1 [67] plyr_1.8.6 magrittr_2.0.1 R6_2.5.0 [70] generics_0.1.0 pillar_1.6.1 haven_2.4.1 [73] foreign_0.8-81 withr_2.4.2 nnet_7.3-15 [76] modelr_0.1.8 crayon_1.4.1 utf8_1.2.1 [79] rmarkdown_2.9 jpeg_0.1-8.1 grid_4.0.5 [82] readxl_1.3.1 blob_1.2.1 callr_3.7.0 [85] reprex_2.0.0 digest_0.6.27 webshot_0.5.2 [88] ComplexUpset_1.3.1 munsell_0.5.0 viridisLite_0.4.0 ```
krassowski commented 3 years ago

Would you like to only plot the bars, or would you want to add more components/annotations? What is the time you are willing to wait? See some recent discussion here: https://github.com/krassowski/complex-upset/issues/133#issuecomment-895155554

16mc1r commented 3 years ago

Just the bars would be fine, its just a way to identifiy relevant subsets. Time: difficult to say, 2min max? This is for "interactive" exploration. RAM is usually not a problem, standard is 128GB but up to 500GB are available if necessary.

tomleung1996 commented 3 years ago

I would also like to see a performance improvement for bar charts (upset plots). I can wait hours or even days, but the premise is the calculation can be fit into my memory (up to 300GB).

krassowski commented 3 years ago

We could have a switch to only compute and plot the the summary statistics instead of individual points, which should give you a substantial performance and memory-use improvement. I will look into it next week.

tomleung1996 commented 3 years ago

We could have a switch to only compute and plot the the summary statistics instead of individual points, which should give you a substantial performance and memory-use improvement. I will look into it next week.

Hi, Krassowski. Hope you had a nice weekend!

I came up with a possible solution to the performance problem but don't know if it would be too hard to implement.

In the original input file (e.g. movies), users can choose to aggregate the sample to the combinations of sets by themselves. This can be done by adding an extra column indicating the "weights" or "number of members". For example:

ID Set1 Set2 Weight
1 TRUE FALSE 10
2 TRUE TRUE 2
3 FALSE TRUE 5

(Rows are distinct)

In this case, users can have more control of their combinations (as well as memory usage) and ComplexUpset is only responsible for plotting.

I am not sure if this violates the original behavior of ComplexUpset. Please ignore this comment if it involves a heavy workload. I appreciate the efforts you made to this wonderful project and all the help from you!

Thank you!

tomleung1996 commented 3 years ago

I finally managed to get my desired plots. If you only want to show percentages and can calculate the numbers by yourselves, you can generate a much smaller sample with the same distribution to get the exact same upset plot.