thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
606 stars 65 forks source link

pangenome: tree, gene presence/absence with gggenome like representation #202

Open tahsinkhann opened 1 month ago

tahsinkhann commented 1 month ago

Hi, I want to a create pangenome plot for bacteriophages with phylogenetic tree (left side) + gene presence/absence (right side) + gggenome plot that has all ORFs/CDS (in top of presence/absence plot)

I have a tree in nwk format, gene_presence_absence.csv from roary/panaroo. All I need is to combine these with a gggenome like pangenome representation in top of gene_presence_absence matrix.

Lastly, thanks for this amaizing gggenomes tool, it benifited me highly.

thackl commented 1 month ago

The easiest way is probably to use https://yulab-smu.top/pkgdocs/aplot.html. Check out section 4.2.2 Aligning plots with a tree. Does that help?

tahsinkhann commented 1 month ago

Thanks for the response. The link was useful, learned new concepts. However, is it possible to generate something like the draft figure I attached. tree_gggenomes_pangenome

in the figure a) tree b) gene_presence_absence.csv from roary/panaroo c) all CDS in a pangenome generated by gggenomes (???)

basically I am thinking of an alternative of the attached circular pangenome figure (I don't know how the authors made this figure, nothing was mentioned about the methods in theri paper) image

Thanks

thackl commented 1 month ago

OK, here's some code that I think goes into the direction that you are looking for, and obviously with room to make it prettier. That said, I think there some conceptual limitations here. All you are showing is relative to the picked reference genome, i.e. it does not show any genes that are not present in the reference and it does not account for rearrangements...

library(tidyverse)
library(ggtree)
library(gggenomes)
library(aplot)

# get a random tree
library(ape)
set.seed(2017)
tree <- rtree(4, tip.label = LETTERS[1:4])
p_tree <- ggtree(tree) + geom_tiplab()
p_tree

image

# get some random gene coords
genes <- filter(emale_genes, seq_id == "Cflag_017B") |> 
  transmute(seq_id = "A", start, end, feat_id, note=Note)

# reference genes - NOTE: any gene that is not present in the reference will not be shown in the plot!
p_genes <- gggenomes(genes, infer_start = 0) +
  geom_gene(aes(fill=note)) + geom_bin_label()
p_genes

image

# get random presence/absence for all reference genes and all genomes
presence <- map(LETTERS[1:4], function(id){
  tibble(
    seq_id = id,
    feat_id = genes$feat_id,
    presence = sample(0:1, length(genes$feat_id), replace = T, prob=c(0.2, 0.8)))
}) |> bind_rows()

ggplot(presence) + geom_raster(aes(feat_id, seq_id, fill=as.logical(presence)))

image

# we need to merge p/a with gene start/end and establish the same sorting order as in tree
tree_label_order <- p_tree$data |> filter(isTip) |> arrange(, y) |> pull(label)
genes_presence <- presence |> 
  left_join(select(genes, -seq_id)) |>  # ignore reference seq_id ("A") to create records for "A-D"
  mutate(
    width = width(start, end),
    seq_id = factor(seq_id, levels=tree_label_order)) # reorder based on tree

# use geom_tile over geom_raster as it allows boxes of different sizes
p_presence <- ggplot(genes_presence) +
  geom_tile(aes(width = width, height = .9, x = (start+end)/2, y = seq_id, fill=as.logical(presence)))
p_presence

image

# combine plots
p_presence |> insert_left(p_tree, width=.2) |> insert_top(p_genes, height=.2)

image