Open nick-youngblut opened 2 years ago
To annotate pan-genome genes, the user is encouraged to export from an existing pan-genome database via anvi-get-sequences-for-gene-calls -g and then annotate the genes via their own methods
Hey @nick-youngblut, this should be a typo :( Can you please show us where did you run into this so we can fix it? Because the only way to annotate genes is in contigs-db files.
To be clear, I guess that I was lead to use sequences-for-gene-calls -g
, since I ran through the pangenomics tutorial, which doesn't include any gene annotation, but later in the tutorial, there is Quantifying functional enrichment in a pangenome
. So, suddenly functional annotations are required. If one has all of there genes in the genomes database, why not annotate all genes together in one go versus per-genome annotations via:
anvi-script-reformat-fasta
anvi-gen-contigs-database
anvi-get-sequences-for-gene-calls
anvi-import-functions
anvi-gen-genomes-storage
The workflow that I list above is really the only way to add functional annotations for a pan-genome analysis, correct? I've already started writing all of the code required to run this mini-pipeline, so hopefully it's correct!
The workflow that I list above is really the only way to add functional annotations for a pan-genome analysis, correct?
Yes, it is. We can improve the tutorial if this is not abundantly clear.
I've already started writing all of the code required to run this mini-pipeline, so hopefully it's correct!
Have you tried anvi'o snakemake workflows? We have one for pangenomics. It may be the easiest way to scale things up unless you have specific needs (such as eggnog-mapper?).
Have you tried anvi'o snakemake workflows?
Thanks for pointing out the workflows! I'm just expanding my existing snakemake workflow that can annotate, taxonomically classify, and map reads to genes. Would it make sense to use anvi-import-functions
on per-gene taxonomic classifications, maybe to assess HGT?
Would it make sense to use anvi-import-functions on per-gene taxonomic classifications, maybe to assess HGT?
Absolutely. You can use a unique name for your 'annotation source', and then you can pull off everything, or you can use anvi-display-functions for a given pangenome to visualize those genes that seem to originate from other taxa (although it may not be an impressive visualization for the genes that are in the right place, it may be useful to cluster genomes in a pangenome based on the distribution of HGTs).
Great! I'll give it a go.
It would be great if anvi-import-functions
could parse compressed annotation tables. If the user have 100's or 1000's of genomes, each with multiple annotations per gene, that could be a lot of (uncompressed) data
Well, even if annotations are compressed, SQLite has no compression option, so they will live their lives uncompressed. We have some re-design ideas to minimize the footprint of categorical data (functions, genes, contig names, etc) for storage efficiency, but currently a contigs-db is larger than a FASTA file when it shouldn't be really.
A workaround would be to uncompress them temporarily prior to import, but of course it is a lot of I/O demand.
We have an archive of a few 100 thousand genomes like that, and it is a major pain. When I have a week of tranquility I will fix it once and for all.
The need
To annotate pan-genome genes, the user is encouraged to export from an existing pan-genome database via
anvi-get-sequences-for-gene-calls -g
and then annotate the genes via their own methods (e.g., run eggnog-mapper themselves). However, after annotating the genes, the docs state to useanvi-import-functions
to get the annotations back into anvio, but this command can only import into a contigs database, and not a genomes database, even though the genes originated from the genomes database, if one usesanvi-get-sequences-for-gene-calls -g
.While I have contigs databases for each individual genome (as shown in the pangenomics tutorial), I do not have 1 contigs database for all merged genomes. So, it's not clear, at least to me, how to either get the annotations in the genomes database, or create a merged contigs database for all genomes and then import via
anvi-import-functions
. It seems thatanvi-merge
just merges profiles.The solution
More docs on a pan-genome functional annotation workflow would be helpful.
anvi-merge-contigs
and/oranvi-import-functions --genomes $GENOME_DB
could also be helpful.If the only way is to create a contigs database from all merged genome fasta files, then it would be helpful to include that info in the pangenomics tutorial. Currently, the tutorial encourages individual contig database files:
Also from the
anvi-gen-contigs-database
:...which encourages 1-genome => 1-contigs_db
Finally, the
Prochlorococcus_31_genomes
example includes 1 contig database per genome.Beneficiaries
Users new to the pan-genome workflow