thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
581 stars 64 forks source link

geneious data input issues #93

Closed tardl closed 2 years ago

tardl commented 2 years ago

Hello, I'm importing annotated data from geneious and so far failed to generate a graph.I currently don't know if some informations are lacking from my input files as I received no error messages when trying. The first gff file in the folder is used for the seqs and the second for the genes annotation. The aim is to show the conservation of the genes in an alignment and the inversion of one of them. Thanks a lot! geneious_data.zip

thackl commented 2 years ago

Hi,

TL;DR There's an easy fix, change the type of your genes from gene to CDS. Then plotting works.

Your gff files aren't wrong but their structure is a bit unexpected. The first one isn't really a gff file, but actually just a fasta file (it still works, though, to read it in as a gff file, just good to know).

The second gff file with the genes causes the problem. Each of your genes is defined by a single line with their type set to gene (3. column). This isn't wrong but unconventional for gff (at least from my experience).

In gffs you usually either have complex gene models comprising different hierarchical features each in their own line of different types including gene, transcript/mRNA, exon, and CDS (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). In such a complex model, the gene annotation is just a container, and the things you actually want to draw are the mRNA and CDS regions.

For simple single-exon prokaryotic genes, where gene, mRNA, exon and CDS are more or less the same you can have simplified gffs with a single line per gene, but then the convention is to at least keep the CDS feature, not the gene feature.

I implemented geom_gene() to understand those complex models as well as the simple CDS only model. It, therefore, doesn't really plot genes but actually the underlying features mRNA, exons and CDS. Since you don't have any of those, nothing is plotted. I have to admit, that isn't necessarily intuitive given the name. Have to think about it, thanks for making me aware!

library(gggenomes)
s0 <- read_seqs("3R_all_extraction.gff")
g0 <- read_feats("3R_all_extraction2.gff")
g0$type[g0$type == "gene"] <- "CDS"

gggenomes(g0, s0) + geom_seq() + geom_gene()

image

And just FYI, this is one way how you could add links between the sequences. Looks like you might need to reorder and rearrange some of the sequences. See pick and flip for that, and let me know if you have more questions on that end.

# This is actually a fasta file:
# mv 3R_all_extraction.gff 3R_all_extraction.fna
# Generate whole genome alignment
# minimap2 -X 3R_all_extraction.fna 3R_all_extraction.fna > 3R_all_extraction.paf
l0 <- read_links("3R_all_extraction.paf")
gggenomes(g0, s0, links=l0) +
  geom_link() +
  geom_seq() +
  geom_bin_label() +
  geom_gene()

image

tardl commented 2 years ago

Thanks a lot!