kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

curate plot overhaul with ggplot2 #52

Closed kfuku52 closed 2 years ago

kfuku52 commented 3 years ago

Currently, curate generates a diagnostic plot (heatmap, dendrogram....) using R's base function with many packages. At some point, this should be replaced with ggplot2's framework for better extensibility and unified package dependency.

https://github.com/kfuku52/amalgkit/blob/62ac91b798af63055cee842bbb3e44bc4f8d8cd8/amalgkit/transcriptome_curation.r#L4-L13

Hego-CCTB commented 3 years ago

So I've been trying to replicate our current plots with ggplot2. The analytic plots are no problem (tsne, pca, histograms), but the dendrogram and heatmaps are giving me some headache. As a reference, this is the current implementation:

Helianthus_annuus.0.original.pdf

I can somewhat approximate the dendrogram, but the plot relevant dendrogram data is just a bunch of x and y coordinates and I have trouble figuring out an elegant way to extract the coordinates belonging to a whole clade, so I can colour the whole clade in the tissue colour (or just the branch for outliers).

ggplot_dendrogram_approx

As for the heatmap, the map itself is no problem at all. Here, it's the two bars indicating bioproject and tissue identity. I haven't tried much in that regard yet, but my best idea would be to create two stacked barplots and add them above the heatmap. I'll post an update after I try this.

kfuku52 commented 3 years ago

Are there any annotations like "parent" or "up" in the dendrogram's input table? Identifying clades should be not so difficult if so. For heatmaps, you can use ComplexHeatmap for the moment, and remove the dependency when you come up with a good solution. https://github.com/jokergoo/ComplexHeatmap

kfuku52 commented 3 years ago

Wait...isn't ComplexHeatmap a ggplot2-based package?

kfuku52 commented 3 years ago

There seems to be a solution, though some adjustments are necessary. https://support.bioconductor.org/p/103113/

Hego-CCTB commented 3 years ago

I looked into complex heatmap, before, but it looked like it's its own thing, separate from ggplot, so I didn't pursue that thought. Complex heatmap seems like a good solution, though!

It would even be possible to combine heatmap and dendrogram with that (although that kind of graph would probably end up too cluttered).

As for the dendrogram, for the ggplot above I use the ggdendro package. I feed an hclust object into a function called dendro_data, which converts the hclust object into something that's readable by ggplot.

The resulting dendro_data table only contains x and y coordinates, so absolutely no info on branch/node relationship is retained. I use geom_segment() to basically just draws lines between the various x and y points. I can add labels to leafs by their y coordinate (always 0). Tracing back to the various branching nodes should be theoretically possible, but it would be really complicated.

kfuku52 commented 3 years ago

ggtree would be a good alternative. But things seem to be getting more complicated. ggplot2 is more extensible, but it seems that the package dependency couldn't be simplified. Probably we should keep the old plotting script until an extension is needed. What do you think?

Hego-CCTB commented 3 years ago

I'll keep a look out for alternative packages, but I agree. We don't need to fix what's not broken for the moment.

kfuku52 commented 3 years ago

Sounds good!