Closed cms72 closed 2 years ago
Not yet, but both are on my mental list of things that would be nice to have. Do you have some example data that you'd want to visualize? That might help me implementing such features faster!
Hi! Thanks for the reply. I do! I'm looking at pathogenicity islands - there are 24 in total - and was looking to compare all of them.
I've compared 2 pathogenicity islands using genoplotR, but I'm quite limited with what I can do - so I'm excited to try your tool.
Great, thanks for sharing. My understanding of the easyfig output is that it is essentially just an output form all-vs-all nucleotide blast comparisons. That is very easy to import into gggenomes. Note, though, that for the two example sequences you sent, there is barely any significant similarity between the sequence on nucleotide level, so the visualization, although working, is not too informative.
library(tidyverse)
library(gggenomes)
# NOTE: had to remove the "()" from the filenames for this to work
s0 <- read_seqs(list.files(pattern="*.gb"))
g0 <- read_feats(list.files(pattern="*.gb"))
# this is just blastn --outfmt 6
e0 <- read_links("12.easyfig.out", format="blast")
# need to match ids - easyfig uses filename instead of the seq_id
e0$seq_id <- s0$seq_id[1]
e0$seq_id2 <- s0$seq_id[2]
gggenomes(g0, s0, links=e0) +
geom_seq() + geom_gene() +
geom_link()
At least for this case with remotely related elements, you get better insights from protein-level comparisons. Those are also quite easy to visualize in gggenomes.
# all-vs-all protein blast
# blastp -query SPI-1_NC_003198_.faa -subject SPI-2_NC_003198_.faa -outfmt 6 -evalue 1e-5 > spi.o6
b0 <- read_sublinks("spi.o6") # sublinks - links connecting features, such as genes, not contigs directly
# overwrite default feat_id parsed from locus_tag with protein_id, which is used
# in protein.faa
g1 <- g0 %>% mutate(feat_id = protein_id)
gggenomes(g1, s0) %>%
add_sublinks(blast=b0) +
geom_seq() + geom_gene() + geom_link()
For larger datasets, I'd alternatively consider doing/visualizing protein clusters instead of just blast hits. Could be done like so
# mmseqs easy-cluster *.faa spi /tmp -e 1e-5 -c 0.7
c0 <- read_tsv("spi_cluster.tsv", col_names=c("cluster_id", "feat_id"))
c1 <- c0 %>% add_count(cluster_id) %>% filter(n>1) # remove singletons
gggenomes(g1, s0) %>%
add_sublinks(blast=b0) %>%
add_clusters(mmseqs=c1) +
geom_seq() +
geom_gene(aes(fill=cluster_id)) + # color genes by clusters
geom_link() + # blast links
# clusters can also be shown as links (overlaying on blast here)
geom_link(aes(color=cluster_id), data=links(mmseqs), fill=NA)
There is also quite a few options for further manipulating your plot - flipping sequences, zooming in etc - that make more sense on larger data set. Happy to talk more if you want to explore this further, also directly via email if that would make it easier for you to share data.
Hi! This looks really great! Thank you!! I will try it out! Would it be the same for mauve outputs (see below)?
Mauve output is a bit more complex. And it depends on what parts of it you are interested / how similar to Mauve's Viewer output you want the result. Below is some simple code that produces a gggenomes plot from mauve output - but it's just quick and dirty with minimal information. As said in the beginning, for better code it depends a bit on how you envision the plot to look like and what information you want it to convey. I'd be curious about your thoughts!
FYI, I took a different example (3 bacterial genomes) because your sequences share so little similarity at nucleotide level that the plot looked strange even in Mauve ;)
Mauve gggenomes
library(tidyverse)
library(gggenomes)
# convert Mauve main output file in a parsible table with some perl magic
# perl -ane 'BEGIN{$i=1; print "seq_id\tstart\tend\tstrand\tfeat_id\tcluster_id\n"} $i++ if /^=/; if(/^>\s?(\d+):(\d+)-(\d+)\s(.)/){print "s$1\t$2\t$3\t$4\tf$1-$i\tc$i\n"}' mauve-out > mauve-out.tsv
# read blocks as features
mauve_blocks <- read_tsv("mauve-pro/hl.tsv") %>%
add_count(cluster_id) %>%
mutate(cluster_id = ifelse(n>1, cluster_id, NA))
# blocks are clustered - easiest way to add links
mauve_clusters <- select(s1, cluster_id, feat_id) %>%
drop_na()
gggenomes(feats=mauve_blocks) %>%
add_clusters(mauve_clusters, .track_id = feats) +
geom_feat(aes(color=cluster_id), position="strand") +
geom_link(aes(fill=cluster_id))
Wow. Thank you! I'll give it a try and ill let you know how I go :)
Hi! So I havent gotten that far - when I run this g0 <- read_feats(list.files(pattern="*.gb"))
, I get this Error in map.poly(database, regions, exact, xlim, ylim, boundary, interior, : no recognized region names
This looks like the wrong map()
function is called inside read_gbk()
(or actually read_gff()
). It looks like map.poly
from https://www.rdocumentation.org/packages/maps/versions/3.4.0/topics/map, but it should use purrr::map()
. You probably loaded the maps package in your environment after purrr/tidyverse (or didn't load purrr/tidyverse at all). Try loading those packages again such that purrr::map
takes priority over maps::map
when just calling map()
. Then read_feats
should work..
I'll update the code so this won't happen in future releases.
Or you could also try to just set map <- purrr::map()
This looks like the wrong
map()
function is called insideread_gbk()
(or actuallyread_gff()
). It looks likemap.poly
from https://www.rdocumentation.org/packages/maps/versions/3.4.0/topics/map, but it should usepurrr::map()
. You probably loaded the maps package in your environment after purrr/tidyverse (or didn't load purrr/tidyverse at all). Try loading those packages again such thatpurrr::map
takes priority overmaps::map
when just callingmap()
. Thenread_feats
should work..I'll update the code so this won't happen in future releases.
This worked! I had restarted R, and called library purrr
and library tidyverse
first, before library gggenomes
Hi! Is there a way to incorporate outputs from Easyfig and Mauve? Thanks!