krassowski / complex-upset

A library for creating complex UpSet plots with ggplot2 geoms
MIT License
469 stars 28 forks source link

`arrange_venn` changes order of rows #116

Closed maximilian-heeg closed 2 years ago

maximilian-heeg commented 3 years ago

Describe the bug Hi,

thanks for that amazing package. I noticed a bug when I tried to create a Venn Diagram.

The order of the rows after calling arrange_venn is different from the order of the original data. Here is a minimal example to illustrate the issue.

Thank you. Max

Code to reproduce

library(tidyverse)
library(ComplexUpset)

df = tibble(
  name = 1:20,
  A = 1:20 %% 2 == 0,
  B = 1:20 %% 3 == 0,
  C = 1:20 %% 5 == 0
)

head(df)

# this column is to track which rows have had TRUE in A
df$WasA = df$A

sets = c("A", "B", "C")
arr = arrange_venn(df, sets)

# the order of the names (numbers) is still the same
all(arr$name == df$name)
# TRUE

# but the values in "A" do not match any longer
all(arr$A == df$A)
# FALSE

ggplot(arr) +
  theme_void() +
  geom_venn_circle(data=df, sets=sets, size=1)+ 
  geom_venn_label_set(df, sets=sets, aes(label=region), outwards_adjust=2.6) + 
  geom_point(aes(x=x, y=y, color=WasA), size=3)

# All green points should be in circle A

image

If you arrange the df first, the function works as expected:

df2 = df
df2$sum = df2$A + df2$B + df2$C
df2 = df2 %>% 
  arrange(sum, desc(A), desc(B), desc(C))

arr2 = arrange_venn(df2, sets)
ggplot(arr2) +
  theme_void() +
  geom_venn_circle(data=df2, sets=sets, size=1)+ 
  geom_venn_label_set(df2, sets=sets, aes(label=region), outwards_adjust=2.6) + 
  geom_point(aes(x=x, y=y, color=WasA), size=3)

image

Context (required)

packageVersion('ComplexUpset') [1] ‘1.2.0’

R version details ```R $platform [1] "x86_64-pc-linux-gnu" $arch [1] "x86_64" $os [1] "linux-gnu" $system [1] "x86_64, linux-gnu" $status [1] "" $major [1] "4" $minor [1] "0.5" $year [1] "2021" $month [1] "03" $day [1] "31" $`svn rev` [1] "80133" $language [1] "R" $version.string [1] "R version 4.0.5 (2021-03-31)" $nickname [1] "Shake and Throw" ```
R session information ```R R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Manjaro Linux Matrix products: default BLAS: /usr/lib/libopenblasp-r0.3.13.so LAPACK: /usr/lib/liblapack.so.3.9.1 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] ComplexUpset_1.2.0 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 [8] tibble_3.1.1 ggplot2_3.3.3 tidyverse_1.3.1 loaded via a namespace (and not attached): [1] bitops_1.0-6 matrixStats_0.58.0 fs_1.5.0 lubridate_1.7.10 bit64_4.0.5 [6] RColorBrewer_1.1-2 httr_1.4.2 GenomeInfoDb_1.26.7 tools_4.0.5 backports_1.2.1 [11] utf8_1.2.1 R6_2.5.0 DBI_1.1.1 BiocGenerics_0.36.1 lazyeval_0.2.2 [16] colorspace_2.0-0 withr_2.4.2 tidyselect_1.1.0 DESeq2_1.30.1 bit_4.0.4 [21] compiler_4.0.5 cli_2.4.0 rvest_1.0.0 Biobase_2.50.0 xml2_1.3.2 [26] DelayedArray_0.16.3 plotly_4.9.3 labeling_0.4.2 scales_1.1.1 genefilter_1.72.1 [31] digest_0.6.27 XVector_0.30.0 pkgconfig_2.0.3 htmltools_0.5.1.1 MatrixGenerics_1.2.1 [36] dbplyr_2.1.1 fastmap_1.1.0 htmlwidgets_1.5.3 rlang_0.4.10 readxl_1.3.1 [41] rstudioapi_0.13 RSQLite_2.2.6 farver_2.1.0 generics_0.1.0 jsonlite_1.7.2 [46] BiocParallel_1.24.1 RCurl_1.98-1.3 magrittr_2.0.1 GenomeInfoDbData_1.2.4 patchwork_1.1.1 [51] Matrix_1.3-2 Rcpp_1.0.6 munsell_0.5.0 S4Vectors_0.28.1 fansi_0.4.2 [56] lifecycle_1.0.0 stringi_1.5.3 SummarizedExperiment_1.20.0 zlibbioc_1.36.0 grid_4.0.5 [61] blob_1.2.1 parallel_4.0.5 crayon_1.4.1 lattice_0.20-41 haven_2.4.0 [66] splines_4.0.5 annotate_1.68.0 hms_1.0.0 locfit_1.5-9.4 pillar_1.6.0 [71] GenomicRanges_1.42.0 geneplotter_1.68.0 stats4_4.0.5 reprex_2.0.0 XML_3.99-0.6 [76] glue_1.4.2 data.table_1.14.0 renv_0.12.0 BiocManager_1.30.12 modelr_0.1.8 [81] vctrs_0.3.7 cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 cachem_1.0.4 [86] xtable_1.8-4 broom_0.7.6 survival_3.2-10 viridisLite_0.4.0 AnnotationDbi_1.52.0 [91] memoise_2.0.0 IRanges_2.24.1 ellipsis_0.3.1 ```
krassowski commented 3 years ago

Thank you for describing the issue in detail and great reproducible examples.

Indeed, the "metadata" columns (all columns that do not define the set) were not getting re-ordered to match the order of the newly arranged data points. To elaborate, getting all(arr$A == df$A) == FALSE is expected but all(arr$name == df$name) should not return TRUE in such a case. I hope that re-shuffling rows does not cause any issues (other than the reported bug).

I have now prepared a fix in https://github.com/krassowski/complex-upset/pull/117 and will release a new version 1.2.1 immediately once all tests pass on CI. Please let me know if you encounter any other issues. Thanks again!

maximilian-heeg commented 2 years ago

Hi,

I noticed today, that the issue is only partially fixed. The fix depends on matching the rownames and works great if a data.frame is used. If, however a tibble is used, it fails (I assume this has to do with the way tibbles handle rownames).

Again, here is a minimal example to illustrate the issue:

library(tidyverse)
library(ComplexUpset)

df = tibble(
  name = 1:20,
  A = 1:20 %% 2 == 0,
  B = 1:20 %% 3 == 0,
  C = 1:20 %% 5 == 0
)

# this column is to track which rows have had TRUE in A
df$WasA = df$A

sets = c("A", "B", "C")
arr = arrange_venn(df, sets)

ggplot(arr) +
  theme_void() +
  geom_venn_circle(data=df, sets=sets, size=1)+ 
  geom_venn_label_set(df, sets=sets, aes(label=region), outwards_adjust=2.6) + 
  geom_point(aes(x=x, y=y, color=WasA), size=3)

# All green points should be in circle A, but this is not the case

# if we convert the tibble to a data.frame, everything works as expected.
df = as.data.frame(df)
arr = arrange_venn(df, sets)

ggplot(arr) +
  theme_void() +
  geom_venn_circle(data=df, sets=sets, size=1)+ 
  geom_venn_label_set(df, sets=sets, aes(label=region), outwards_adjust=2.6) + 
  geom_point(aes(x=x, y=y, color=WasA), size=3)

# All green points are in circle A

So I guess the easiest fix would be to make sure that tibbles are converted to data.frames before calling arrange_venn.

Max

krassowski commented 2 years ago

So I guess the easiest fix would be to make sure that tibbles are converted to data.frames before calling arrange_venn.

I agree. Would you like to open a PR with the fix and the test case that you demonstrate?