I am very excited to have found this hierarchicalSets. It looks like it very efficiently solves phylogenomic clustering in an intuitive manner that even a microbiologist like me understands. The package works in my hands mostly, but not the heatmap or composite plot types. This is true for either my own data or the twitter data supplied with the package.
I've attached the input data I am working with and have pasted my R-markdown file contents below. Still love the package. I am also new to github and hope I did this correctly.
Excellent package, but I am having some small problems
I have an 18,000+ row x 10 column matrix for gene presence-absence that was produced from Anvi'o. What I want to do is use this data in hierarchicalSets to see if I can reproduce clustering that I expect. The hope was that I could produce a reasonable looking plot. Expectations:
1-the genomes would cluster into their respective phyla and genus.
2-clusters of genes would be evident for each lineage
Ultimately I would like to compare clustering by this method with clustering by metabolomics data to generate gene<->metabolomic feature relationships, particularly for genes with no predicted function.
I found hierarchicalSets and your 2017 Bioinformatics paper while banging my head against other clustering methods. The underlying idea is intuitive and appealing to a biologist like me. For this markdown, I have already installed the package.
Good, this meets expectations based on the file I put in.
plot(madset)
Excellent. The genomes are separated into their respective Phyla and the sublineages are associated correctly with Cab. validum being a deep split off the Acidobacteriota. The green and brown Chlorobi are split correctly, and the splits in the others are sensible. Branch lengths are longer for Acidobacteriota than Chlorobi reflecting the greater genetic diversity/genome size in the Acidobacteriota.
plot(madset, type = 'intersectStack', showHierarchy = TRUE)
Great! A very nice depiction of gene clusters and the number of singleton clusters that are driving the finer scale separations. The branch length difference between the two phyla is also far more apparent here.
Now for the problem
There are supposed to be 'heatmap' and 'composite' plots that would be great to see, but they do not appear to work on this data set. Here's how I generate the error:
plot(madset, type = 'heatmap', showHierarchy = TRUE)
Error in geom_rect2():
! Problem while converting geom to grob.
ℹ Error occurred in the 1st layer.
Caused by error in check.length():
! 'gpar' element 'lwd' must not be length 0
Backtrace:
base (local) <fn>(x)
ggplot2:::print.ggplot(x)
ggplot2:::ggplot_gtable.ggplot_built(data)
ggplot2:::by_layer(...)
ggplot2 (local) f(l = layers[[i]], d = data[[i]])
...
grid::gpar(...)
grid:::validGP(list(...))
grid (local) numnotnull("lwd")
grid (local) check.length(gparname)
base::stop(...)
OK, not great, but maybe it is just my data that is weird. So, I checked it with the example data in the package
So far so good. Now trying intersectStack and heatmap
plot(twitSet, type = 'intersectStack', showHierarchy = TRUE)
Looks like the plot on github and no errors.
plot(twitSet, type = 'heatmap', showHierarchy = TRUE)
Error in geom_rect2():
! Problem while converting geom to grob.
ℹ Error occurred in the 1st layer.
Caused by error in check.length():
! 'gpar' element 'lwd' must not be length 0
Backtrace:
base (local) <fn>(x)
ggplot2:::print.ggplot(x)
ggplot2:::ggplot_gtable.ggplot_built(data)
ggplot2:::by_layer(...)
ggplot2 (local) f(l = layers[[i]], d = data[[i]])
...
grid::gpar(...)
grid:::validGP(list(...))
grid (local) numnotnull("lwd")
grid (local) check.length(gparname)
base::stop(...)
Error in geom_rect2(aes(xmin = y - 0.5, xmax = y + 0.5, ymin = x - 0.5, :
ℹ Error occurred in the 1st layer.
Caused by error in check.length():
! 'gpar' element 'lwd' must not be length 0
Show in New Window
OK, so it is not just my genome data, but the example data from github fails the same way. It also does not matter if showHierarchy = FALSE or I leave it off entirely.
Is the composite plot how Fig. 5 was made in your Bioinformatics paper? The composite plot type also fails the same way.
One workaround for the heatmap, and I apologize if it exists and I didn't find it, might be to export the clustering result to a matrix with an additional column for clusterID. Then a heatmap could be made using this to sort the presence/absence data in the matrix. Is this possible and sensible? That might make it easier to look for enrichments of specific COG or KO groups within clusters as these were preserved in the gene names in my case.
Here's my session information in case there is something obvious there.
Session Information
sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Looking forward, I am collecting metabolomic data from these strains. If we can transform the metabolomic feature data to presence/absence, they can be handled exactly the same way as you state in your paper "Although Hierarchical Sets has been developed for the purpose of visualizing pangenome data, the approach is agnostic to the underlying data type, and it could potentially be applied to other large-scale set visualization problems, especially set data with a clear hierarchical interpretation." This would at least allow a visualization of whether or not the metabolomic data is mirroring the genomic data similarly as in Fig. 2 of your paper.
I am an amateur at high dimensional data analysis, so I'm not really clear on how to correlate patterns between the two matrices, but just visualizing it would be helpful. And I think this package will do that beautifully.
Hello,
I am very excited to have found this hierarchicalSets. It looks like it very efficiently solves phylogenomic clustering in an intuitive manner that even a microbiologist like me understands. The package works in my hands mostly, but not the heatmap or composite plot types. This is true for either my own data or the twitter data supplied with the package.
I've attached the input data I am working with and have pasted my R-markdown file contents below. Still love the package. I am also new to github and hope I did this correctly.
Best, Tom Hanson Phylogenomic_MATRIX.csv
title: "Phylogenomic_clustering" author: "Thomas Hanson" date: "
r Sys.Date()
" output: html_documentExcellent package, but I am having some small problems
I have an 18,000+ row x 10 column matrix for gene presence-absence that was produced from Anvi'o. What I want to do is use this data in hierarchicalSets to see if I can reproduce clustering that I expect. The hope was that I could produce a reasonable looking plot. Expectations: 1-the genomes would cluster into their respective phyla and genus. 2-clusters of genes would be evident for each lineage
Ultimately I would like to compare clustering by this method with clustering by metabolomics data to generate gene<->metabolomic feature relationships, particularly for genes with no predicted function.
I found hierarchicalSets and your 2017 Bioinformatics paper while banging my head against other clustering methods. The underlying idea is intuitive and appealing to a biologist like me. For this markdown, I have already installed the package.
A HierarchicalSet object
Number of independent clusters: 1
Good, this meets expectations based on the file I put in.
Excellent. The genomes are separated into their respective Phyla and the sublineages are associated correctly with Cab. validum being a deep split off the Acidobacteriota. The green and brown Chlorobi are split correctly, and the splits in the others are sensible. Branch lengths are longer for Acidobacteriota than Chlorobi reflecting the greater genetic diversity/genome size in the Acidobacteriota.
Great! A very nice depiction of gene clusters and the number of singleton clusters that are driving the finer scale separations. The branch length difference between the two phyla is also far more apparent here.
Now for the problem
There are supposed to be 'heatmap' and 'composite' plots that would be great to see, but they do not appear to work on this data set. Here's how I generate the error:
Error in
geom_rect2()
: ! Problem while converting geom to grob. ℹ Error occurred in the 1st layer. Caused by error incheck.length()
: ! 'gpar' element 'lwd' must not be length 0 Backtrace:<fn>
(x)OK, not great, but maybe it is just my data that is weird. So, I checked it with the example data in the package
A HierarchicalSet object
Number of independent clusters: 6
So far so good. Now trying intersectStack and heatmap
Looks like the plot on github and no errors.
Error in
geom_rect2()
: ! Problem while converting geom to grob. ℹ Error occurred in the 1st layer. Caused by error incheck.length()
: ! 'gpar' element 'lwd' must not be length 0 Backtrace:<fn>
(x)ℹ Error occurred in the 1st layer. Caused by error in
check.length()
: ! 'gpar' element 'lwd' must not be length 0 Show in New WindowOK, so it is not just my genome data, but the example data from github fails the same way. It also does not matter if showHierarchy = FALSE or I leave it off entirely.
Is the composite plot how Fig. 5 was made in your Bioinformatics paper? The composite plot type also fails the same way.
One workaround for the heatmap, and I apologize if it exists and I didn't find it, might be to export the clustering result to a matrix with an additional column for clusterID. Then a heatmap could be made using this to sort the presence/absence data in the matrix. Is this possible and sensible? That might make it easier to look for enrichments of specific COG or KO groups within clusters as these were preserved in the gene names in my case.
Here's my session information in case there is something obvious there.
Session Information
R version 4.2.3 (2023-03-15 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] hierarchicalSets_1.0.4
loaded via a namespace (and not attached): [1] Rcpp_1.0.10 pillar_1.9.0 compiler_4.2.3 RColorBrewer_1.1-3 [5] viridis_0.6.2 tools_4.2.3 digest_0.6.31 viridisLite_0.4.1 [9] evaluate_0.20 lifecycle_1.0.3 tibble_3.2.1 gtable_0.3.3
[13] lattice_0.20-45 pkgconfig_2.0.3 rlang_1.1.0 Matrix_1.5-3
[17] cli_3.6.1 rstudioapi_0.14 commonmark_1.9.0 patchwork_1.1.3
[21] yaml_2.3.7 xfun_0.38 ggdendro_0.1.23 fastmap_1.1.1
[25] gridExtra_2.3 withr_2.5.0 xml2_1.3.3 dplyr_1.1.1
[29] knitr_1.42 generics_0.1.3 vctrs_0.6.1 grid_4.2.3
[33] tidyselect_1.2.0 glue_1.6.2 R6_2.5.1 fansi_1.0.4
[37] rmarkdown_2.21 farver_2.1.1 ggplot2_3.4.2 magrittr_2.0.3
[41] scales_1.2.1 htmltools_0.5.5 MASS_7.3-58.2 colorspace_2.1-0
[45] labeling_0.4.2 utf8_1.2.3 munsell_0.5.0
Looking forward, I am collecting metabolomic data from these strains. If we can transform the metabolomic feature data to presence/absence, they can be handled exactly the same way as you state in your paper "Although Hierarchical Sets has been developed for the purpose of visualizing pangenome data, the approach is agnostic to the underlying data type, and it could potentially be applied to other large-scale set visualization problems, especially set data with a clear hierarchical interpretation." This would at least allow a visualization of whether or not the metabolomic data is mirroring the genomic data similarly as in Fig. 2 of your paper.
I am an amateur at high dimensional data analysis, so I'm not really clear on how to correlate patterns between the two matrices, but just visualizing it would be helpful. And I think this package will do that beautifully.