heatmap and composite plot types do not appear to work

tehanson-UD commented 1 year ago

Hello,

I am very excited to have found this hierarchicalSets. It looks like it very efficiently solves phylogenomic clustering in an intuitive manner that even a microbiologist like me understands. The package works in my hands mostly, but not the heatmap or composite plot types. This is true for either my own data or the twitter data supplied with the package.

I've attached the input data I am working with and have pasted my R-markdown file contents below. Still love the package. I am also new to github and hope I did this correctly.

Best, Tom Hanson Phylogenomic_MATRIX.csv

title: "Phylogenomic_clustering" author: "Thomas Hanson" date: "`r Sys.Date()`" output: html_document

knitr::opts_chunk$set(echo = TRUE)

Excellent package, but I am having some small problems

I have an 18,000+ row x 10 column matrix for gene presence-absence that was produced from Anvi'o. What I want to do is use this data in hierarchicalSets to see if I can reproduce clustering that I expect. The hope was that I could produce a reasonable looking plot. Expectations: 1-the genomes would cluster into their respective phyla and genus. 2-clusters of genes would be evident for each lineage

Ultimately I would like to compare clustering by this method with clustering by metabolomics data to generate gene<->metabolomic feature relationships, particularly for genes with no predicted function.

I found hierarchicalSets and your 2017 Bioinformatics paper while banging my head against other clustering methods. The underlying idea is intuitive and appealing to a biologist like me. For this markdown, I have already installed the package.

madmatrix <- as.matrix(read.csv("Phylogenomic_MATRIX.csv", row.names = 1))
library(hierarchicalSets)
madset <- create_hierarchy(madmatrix)
madset

A HierarchicalSet object

             Universe size: 18732
            Number of sets: 10

Number of independent clusters: 1

Good, this meets expectations based on the file I put in.

plot(madset)

Excellent. The genomes are separated into their respective Phyla and the sublineages are associated correctly with Cab. validum being a deep split off the Acidobacteriota. The green and brown Chlorobi are split correctly, and the splits in the others are sensible. Branch lengths are longer for Acidobacteriota than Chlorobi reflecting the greater genetic diversity/genome size in the Acidobacteriota.

plot(madset, type = 'intersectStack', showHierarchy = TRUE)

Great! A very nice depiction of gene clusters and the number of singleton clusters that are driving the finer scale separations. The branch length difference between the two phyla is also far more apparent here.

Now for the problem

There are supposed to be 'heatmap' and 'composite' plots that would be great to see, but they do not appear to work on this data set. Here's how I generate the error:

plot(madset, type = 'heatmap', showHierarchy = TRUE)

Error in geom_rect2(): ! Problem while converting geom to grob. ℹ Error occurred in the 1st layer. Caused by error in check.length(): ! 'gpar' element 'lwd' must not be length 0 Backtrace:

base (local) <fn>(x)
ggplot2:::print.ggplot(x)
ggplot2:::ggplot_gtable.ggplot_built(data)
ggplot2:::by_layer(...)
1. ggplot2 (local) f(l = layers[[i]], d = data[[i]]) ...
2. grid::gpar(...)
3. grid:::validGP(list(...))
4. grid (local) numnotnull("lwd")
5. grid (local) check.length(gparname)
6. base::stop(...)

OK, not great, but maybe it is just my data that is weird. So, I checked it with the example data in the package

data("twitter")
twitSet <- create_hierarchy(twitter)
twitSet

A HierarchicalSet object

             Universe size: 28459
            Number of sets: 100

Number of independent clusters: 6

So far so good. Now trying intersectStack and heatmap

plot(twitSet, type = 'intersectStack', showHierarchy = TRUE)

Looks like the plot on github and no errors.

plot(twitSet, type = 'heatmap', showHierarchy = TRUE)

Error in geom_rect2(): ! Problem while converting geom to grob. ℹ Error occurred in the 1st layer. Caused by error in check.length(): ! 'gpar' element 'lwd' must not be length 0 Backtrace:

base (local) <fn>(x)
ggplot2:::print.ggplot(x)
ggplot2:::ggplot_gtable.ggplot_built(data)
ggplot2:::by_layer(...)
1. ggplot2 (local) f(l = layers[[i]], d = data[[i]]) ...
2. grid::gpar(...)
3. grid:::validGP(list(...))
4. grid (local) numnotnull("lwd")
5. grid (local) check.length(gparname)
6. base::stop(...) Error in geom_rect2(aes(xmin = y - 0.5, xmax = y + 0.5, ymin = x - 0.5, :

ℹ Error occurred in the 1st layer. Caused by error in check.length(): ! 'gpar' element 'lwd' must not be length 0 Show in New Window

OK, so it is not just my genome data, but the example data from github fails the same way. It also does not matter if showHierarchy = FALSE or I leave it off entirely.

Is the composite plot how Fig. 5 was made in your Bioinformatics paper? The composite plot type also fails the same way.

One workaround for the heatmap, and I apologize if it exists and I didn't find it, might be to export the clustering result to a matrix with an additional column for clusterID. Then a heatmap could be made using this to sort the presence/absence data in the matrix. Is this possible and sensible? That might make it easier to look for enrichments of specific COG or KO groups within clusters as these were preserved in the gene names in my case.

Here's my session information in case there is something obvious there.

Session Information

sessionInfo()

R version 4.2.3 (2023-03-15 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] hierarchicalSets_1.0.4

loaded via a namespace (and not attached): [1] Rcpp_1.0.10 pillar_1.9.0 compiler_4.2.3 RColorBrewer_1.1-3 [5] viridis_0.6.2 tools_4.2.3 digest_0.6.31 viridisLite_0.4.1 [9] evaluate_0.20 lifecycle_1.0.3 tibble_3.2.1 gtable_0.3.3
[13] lattice_0.20-45 pkgconfig_2.0.3 rlang_1.1.0 Matrix_1.5-3
[17] cli_3.6.1 rstudioapi_0.14 commonmark_1.9.0 patchwork_1.1.3
[21] yaml_2.3.7 xfun_0.38 ggdendro_0.1.23 fastmap_1.1.1
[25] gridExtra_2.3 withr_2.5.0 xml2_1.3.3 dplyr_1.1.1
[29] knitr_1.42 generics_0.1.3 vctrs_0.6.1 grid_4.2.3
[33] tidyselect_1.2.0 glue_1.6.2 R6_2.5.1 fansi_1.0.4
[37] rmarkdown_2.21 farver_2.1.1 ggplot2_3.4.2 magrittr_2.0.3
[41] scales_1.2.1 htmltools_0.5.5 MASS_7.3-58.2 colorspace_2.1-0
[45] labeling_0.4.2 utf8_1.2.3 munsell_0.5.0

Looking forward, I am collecting metabolomic data from these strains. If we can transform the metabolomic feature data to presence/absence, they can be handled exactly the same way as you state in your paper "Although Hierarchical Sets has been developed for the purpose of visualizing pangenome data, the approach is agnostic to the underlying data type, and it could potentially be applied to other large-scale set visualization problems, especially set data with a clear hierarchical interpretation." This would at least allow a visualization of whether or not the metabolomic data is mirroring the genomic data similarly as in Fig. 2 of your paper.

I am an amateur at high dimensional data analysis, so I'm not really clear on how to correlate patterns between the two matrices, but just visualizing it would be helpful. And I think this package will do that beautifully.

thomasp85 commented 1 year ago

Thanks - this package has laid dormant for some time and hasn't been updated to accommodate changes upstream - I'll take a look at it

tehanson-UD commented 1 year ago

Many thanks for the quick turnaround. It works beautifully now.

thomasp85 / hierarchicalSets