[FEATURE REQUEST] improved docs on pangenome workflow that includes annotations

nick-youngblut commented 2 years ago

The need

To annotate pan-genome genes, the user is encouraged to export from an existing pan-genome database via anvi-get-sequences-for-gene-calls -g and then annotate the genes via their own methods (e.g., run eggnog-mapper themselves). However, after annotating the genes, the docs state to use anvi-import-functions to get the annotations back into anvio, but this command can only import into a contigs database, and not a genomes database, even though the genes originated from the genomes database, if one uses anvi-get-sequences-for-gene-calls -g.

While I have contigs databases for each individual genome (as shown in the pangenomics tutorial), I do not have 1 contigs database for all merged genomes. So, it's not clear, at least to me, how to either get the annotations in the genomes database, or create a merged contigs database for all genomes and then import via anvi-import-functions. It seems that anvi-merge just merges profiles.

The solution

More docs on a pan-genome functional annotation workflow would be helpful. anvi-merge-contigs and/or anvi-import-functions --genomes $GENOME_DB could also be helpful.

If the only way is to create a contigs database from all merged genome fasta files, then it would be helpful to include that info in the pangenomics tutorial. Currently, the tutorial encourages individual contig database files:

name contigs_db_path
Name_01 /path/to/contigs-01.db
Name_02 /path/to/contigs-02.db
Name_03 /path/to/contigs-03.db
(…) (…)

Also from the anvi-gen-contigs-database:

  -f FASTA, --contigs-fasta FASTA
                        The FASTA file that contains reference sequences you
                        mapped your samples against. This could be a reference
                        genome, or contigs from your assembler. Contig names
                        in this file must match to those in other input files.
                        If there is a problem anvi'o will gracefully complain
                        about it. (default: None)

...which encourages 1-genome => 1-contigs_db

Finally, the Prochlorococcus_31_genomes example includes 1 contig database per genome.

Beneficiaries

Users new to the pan-genome workflow

meren commented 2 years ago

To annotate pan-genome genes, the user is encouraged to export from an existing pan-genome database via anvi-get-sequences-for-gene-calls -g and then annotate the genes via their own methods

Hey @nick-youngblut, this should be a typo :( Can you please show us where did you run into this so we can fix it? Because the only way to annotate genes is in contigs-db files.

nick-youngblut commented 2 years ago

To be clear, I guess that I was lead to use sequences-for-gene-calls -g, since I ran through the pangenomics tutorial, which doesn't include any gene annotation, but later in the tutorial, there is Quantifying functional enrichment in a pangenome. So, suddenly functional annotations are required. If one has all of there genes in the genomes database, why not annotate all genes together in one go versus per-genome annotations via:

for each genome:
- anvi-script-reformat-fasta
- anvi-gen-contigs-database
- anvi-get-sequences-for-gene-calls
- [somehow annotate; e.g., eggnog-mapper]
- [somehow format the annotations as needed for anvio import]
- anvi-import-functions
finally: anvi-gen-genomes-storage

The workflow that I list above is really the only way to add functional annotations for a pan-genome analysis, correct? I've already started writing all of the code required to run this mini-pipeline, so hopefully it's correct!

meren commented 2 years ago

The workflow that I list above is really the only way to add functional annotations for a pan-genome analysis, correct?

Yes, it is. We can improve the tutorial if this is not abundantly clear.

I've already started writing all of the code required to run this mini-pipeline, so hopefully it's correct!

Have you tried anvi'o snakemake workflows? We have one for pangenomics. It may be the easiest way to scale things up unless you have specific needs (such as eggnog-mapper?).

nick-youngblut commented 2 years ago

Have you tried anvi'o snakemake workflows?

Thanks for pointing out the workflows! I'm just expanding my existing snakemake workflow that can annotate, taxonomically classify, and map reads to genes. Would it make sense to use anvi-import-functions on per-gene taxonomic classifications, maybe to assess HGT?

meren commented 2 years ago

Would it make sense to use anvi-import-functions on per-gene taxonomic classifications, maybe to assess HGT?

Absolutely. You can use a unique name for your 'annotation source', and then you can pull off everything, or you can use anvi-display-functions for a given pangenome to visualize those genes that seem to originate from other taxa (although it may not be an impressive visualization for the genes that are in the right place, it may be useful to cluster genomes in a pangenome based on the distribution of HGTs).

nick-youngblut commented 2 years ago

Great! I'll give it a go.

It would be great if anvi-import-functions could parse compressed annotation tables. If the user have 100's or 1000's of genomes, each with multiple annotations per gene, that could be a lot of (uncompressed) data

meren commented 2 years ago

Well, even if annotations are compressed, SQLite has no compression option, so they will live their lives uncompressed. We have some re-design ideas to minimize the footprint of categorical data (functions, genes, contig names, etc) for storage efficiency, but currently a contigs-db is larger than a FASTA file when it shouldn't be really.

A workaround would be to uncompress them temporarily prior to import, but of course it is a lot of I/O demand.

We have an archive of a few 100 thousand genomes like that, and it is a major pain. When I have a week of tranquility I will fix it once and for all.

merenlab / anvio