thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
581 stars 64 forks source link

Easyfig or mauve outputs? #82

Closed cms72 closed 2 years ago

cms72 commented 2 years ago

Hi! Is there a way to incorporate outputs from Easyfig and Mauve? Thanks!

thackl commented 2 years ago

Not yet, but both are on my mental list of things that would be nice to have. Do you have some example data that you'd want to visualize? That might help me implementing such features faster!

cms72 commented 2 years ago

Hi! Thanks for the reply. I do! I'm looking at pathogenicity islands - there are 24 in total - and was looking to compare all of them.

Archive.zip

I've compared 2 pathogenicity islands using genoplotR, but I'm quite limited with what I can do - so I'm excited to try your tool.

thackl commented 2 years ago

Great, thanks for sharing. My understanding of the easyfig output is that it is essentially just an output form all-vs-all nucleotide blast comparisons. That is very easy to import into gggenomes. Note, though, that for the two example sequences you sent, there is barely any significant similarity between the sequence on nucleotide level, so the visualization, although working, is not too informative.

library(tidyverse)
library(gggenomes)

# NOTE: had to remove the "()" from the filenames for this to work
s0 <- read_seqs(list.files(pattern="*.gb"))
g0 <- read_feats(list.files(pattern="*.gb"))

# this is just blastn --outfmt 6
e0 <- read_links("12.easyfig.out", format="blast")
# need to match ids - easyfig uses filename instead of the seq_id
e0$seq_id <- s0$seq_id[1]
e0$seq_id2 <- s0$seq_id[2]

gggenomes(g0, s0, links=e0) +
  geom_seq() + geom_gene() +
  geom_link()

82-1

At least for this case with remotely related elements, you get better insights from protein-level comparisons. Those are also quite easy to visualize in gggenomes.

# all-vs-all protein blast
# blastp -query SPI-1_NC_003198_.faa -subject SPI-2_NC_003198_.faa -outfmt 6 -evalue 1e-5 > spi.o6
b0 <- read_sublinks("spi.o6") # sublinks - links connecting features, such as genes, not contigs directly
# overwrite default feat_id parsed from locus_tag with protein_id, which is used
# in protein.faa
g1 <- g0 %>% mutate(feat_id = protein_id) 

gggenomes(g1, s0) %>%
  add_sublinks(blast=b0) +
  geom_seq() + geom_gene() + geom_link()

82-2

For larger datasets, I'd alternatively consider doing/visualizing protein clusters instead of just blast hits. Could be done like so

# mmseqs easy-cluster *.faa spi /tmp -e 1e-5 -c 0.7
c0 <- read_tsv("spi_cluster.tsv", col_names=c("cluster_id", "feat_id"))
c1 <- c0 %>% add_count(cluster_id) %>% filter(n>1) # remove singletons

gggenomes(g1, s0) %>%
  add_sublinks(blast=b0) %>%
  add_clusters(mmseqs=c1) +
  geom_seq() +
  geom_gene(aes(fill=cluster_id)) + # color genes by clusters
  geom_link() + # blast links
  # clusters can also be shown as links (overlaying on blast here)
  geom_link(aes(color=cluster_id), data=links(mmseqs), fill=NA)

82-3

There is also quite a few options for further manipulating your plot - flipping sequences, zooming in etc - that make more sense on larger data set. Happy to talk more if you want to explore this further, also directly via email if that would make it easier for you to share data.

cms72 commented 2 years ago

Hi! This looks really great! Thank you!! I will try it out! Would it be the same for mauve outputs (see below)?

Archive.zip

thackl commented 2 years ago

Mauve output is a bit more complex. And it depends on what parts of it you are interested / how similar to Mauve's Viewer output you want the result. Below is some simple code that produces a gggenomes plot from mauve output - but it's just quick and dirty with minimal information. As said in the beginning, for better code it depends a bit on how you envision the plot to look like and what information you want it to convey. I'd be curious about your thoughts!

FYI, I took a different example (3 bacterial genomes) because your sequences share so little similarity at nucleotide level that the plot looked strange even in Mauve ;)

Mauve image gggenomes image

library(tidyverse)
library(gggenomes)

# convert Mauve main output file in a parsible table with some perl magic
# perl -ane 'BEGIN{$i=1; print "seq_id\tstart\tend\tstrand\tfeat_id\tcluster_id\n"} $i++ if /^=/; if(/^>\s?(\d+):(\d+)-(\d+)\s(.)/){print "s$1\t$2\t$3\t$4\tf$1-$i\tc$i\n"}' mauve-out > mauve-out.tsv

# read blocks as features
mauve_blocks <- read_tsv("mauve-pro/hl.tsv") %>%
  add_count(cluster_id) %>%
  mutate(cluster_id = ifelse(n>1, cluster_id, NA))

# blocks are clustered - easiest way to add links
mauve_clusters <- select(s1, cluster_id, feat_id) %>%
  drop_na()

gggenomes(feats=mauve_blocks) %>%
  add_clusters(mauve_clusters, .track_id = feats) +
  geom_feat(aes(color=cluster_id), position="strand") +
  geom_link(aes(fill=cluster_id))
cms72 commented 2 years ago

Wow. Thank you! I'll give it a try and ill let you know how I go :)

cms72 commented 2 years ago

Hi! So I havent gotten that far - when I run this g0 <- read_feats(list.files(pattern="*.gb")), I get this Error in map.poly(database, regions, exact, xlim, ylim, boundary, interior, : no recognized region names

image

thackl commented 2 years ago

This looks like the wrong map() function is called inside read_gbk() (or actually read_gff()). It looks like map.poly from https://www.rdocumentation.org/packages/maps/versions/3.4.0/topics/map, but it should use purrr::map(). You probably loaded the maps package in your environment after purrr/tidyverse (or didn't load purrr/tidyverse at all). Try loading those packages again such that purrr::map takes priority over maps::map when just calling map(). Then read_feats should work..

I'll update the code so this won't happen in future releases.

thackl commented 2 years ago

Or you could also try to just set map <- purrr::map()

cms72 commented 2 years ago

This looks like the wrong map() function is called inside read_gbk() (or actually read_gff()). It looks like map.poly from https://www.rdocumentation.org/packages/maps/versions/3.4.0/topics/map, but it should use purrr::map(). You probably loaded the maps package in your environment after purrr/tidyverse (or didn't load purrr/tidyverse at all). Try loading those packages again such that purrr::map takes priority over maps::map when just calling map(). Then read_feats should work..

I'll update the code so this won't happen in future releases.

This worked! I had restarted R, and called library purrr and library tidyverse first, before library gggenomes