merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
415 stars 142 forks source link

[FEATURE REQUEST] Preserve prodigal metadata for `anvi-export-gene-calls` #2181

Open Ge0rges opened 7 months ago

Ge0rges commented 7 months ago

The need

Identifying things like RBS motif, start codon, etc. can come in handy with gene-aware analyses. Such analyses may become more frequent in the future especially given the current effort

to soon implement a general framework for storing the coordinates of any type of genomic feature

2152

The solution

From @ivagljiva on discord:

to modify what we store in the [contigs] DB, and then add a flag to anvi-export-gene-calls (ie, in the genes_in_contigs table) to include that information in the output.

Perhaps this should be relegated to the effort mentioned by @semiller10 in #2152, but I thought it pertinent to bring it up.

Beneficiaries

Anyone doing gene aware analyses.

meren commented 7 months ago

Just a quick note as I'm passing through this: When we do this we need to think of a way that doesn't require a specific design that locks us in with Prodigal for gene calling. A generic design that can keep track of additional features for genes (or genomic regions, or nucleotides, or codons) that can also be populated from Prodigal output.

Ge0rges commented 4 months ago

Because I understand this exists in the context of broader changes that need be done (that are beyond my current mastery of the Anvi'o codebase), here is a temporary pseudo-solution for anyone who ends up here.

I wrote this script which essentially piggy backs on Anvi'o prodigal caller and response parser:

import sys
sys.path.insert(1, '/path/to/anvio/anvio/drivers')
from prodigal import Prodigal

class ExtendedProdigal(Prodigal):
    def __init__(self):
        super().__init__()

        self.available_parsers = {'v2.6.3': self.extended_parser,
                                  'v2.6.2': self.extended_parser,
                                  'v2.6.0': self.extended_parser}

        super().check_version()

    def extended_parser(self, defline):
        """parses this, but keep more information than the default anvio parser:

            204_10M_MERGED.PERFECT.gz.keep_contig_1720_7 # 7086 # 7262 # 1 # ID=3_7;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.294

        """

        hit = self._Prodigal__parser_1(defline)

        fields = defline.split()

        additional_attributes = fields[8].split(';')

        hit['start_codon'] = additional_attributes[2].split('=')[1]
        hit['rbs_motif'] = additional_attributes[3].split('=')[1]
        hit['rbs_spacer'] = additional_attributes[4].split('=')[1]
        hit['gc_cont'] = additional_attributes[5].split('=')[1]

        return hit

prodigal = ExtendedProdigal()
prodigal.num_threads = int(sys.argv[3])
gene_calls, amino_acids = prodigal.process(sys.argv[1], sys.argv[2])

This won't update your contigs database or otherwise modify any Anvi'o functionality, however if you call it with a FASTA as input, it will return the same dict anvi'o generates in addition to keeping the other prodigal outputs chosen here (e.g. 'gc_cont'). Example run command python prodigal_caller.py mags/Pelagibacter_r-contigs.fna output_folder 40 where the format is python script_name.py input_fasta_path output_folder_for_anvio thread_number.

Ge0rges commented 4 months ago

@meren sidenote: while working on this I noticed an important typo here. I believe that line should read v2.6.0. I didn't want to make a pull request just for that one typo though (but I can if you'd like!).

meren commented 4 months ago

Thanks for letting me know about the typo, @Ge0rges. I'm not sure how did it survive this long. I guess because no one is using Prodigal v2.6.0 anymore.

Your temporary workaround is masterful and beautiful. Regarding the original feature request: this has been a difficult one to address because it requires a change in the way we keep gene calls in our relevant table with the addition of a few new columns, which will likely add millions of additional data points to that table, increasing the contigs-db size by a lot while only being relevant to a fraction of the users.

A better solution would be to extend that table if anvi-gen-contigs-db receives a specific flag (i.e., --store-extended-gene-call-data) to store larger amount of information regarding gene calls, rather than doing it every time anvi-gen-contigs-db. But since this table will already go through a change soon with the eukaryotic gene calls, I've been sitting on my hands :)

Ge0rges commented 4 months ago

That makes sense. I was wondering if the revamped mentioned in #2152 is thought of to affect the contigs-db or to involve the creation of new type of artifact centered around genomes/MAGs? If the latter was the case, this feature could be relegated to that artifact rather than expanding on contigs-db.

meren commented 4 months ago

I think it will have to be new, optional tables in contigs-db. We already have the code to mark nucleotide, codon/amino acid positions in contig sequences in contigs-db files, but they are not used outside of anvi'o structure currently. We will have to make them more accessible to mainstream programs :)

The best way to get these things done is to have a project in the lab that needs this solution to be in place. That's why there is a delay currently :(

Ge0rges commented 2 weeks ago

Hi @meren I've run into an issue with my workaround and wanted to see if you had any ideas. My goal is to get list of gene calls, and a list of functions, (for an anvio curated MAG, i.e. bin in my contigs DB/ a fasta file) that I can use separately and then downstream match the gene_callers_id to execute other analyses.

The issue with my workaround is that it produces an export of gene calls with gene_callers_id different to those that are already in the contigs db. But simultaneously, I need that extra metadata from the more extensive prodigal output parsing.

What would be the good way to go about this? I thought of modifying the function annotation commands to take in a fasta file as well but that seems pretty daunting and perhaps inefficient?

meren commented 4 days ago

Dear @Ge0rges, I just sent a PR for anvio-dev (#2306), which replaces prodigal with pyrodigal. The PR also includes a new parameter for anvi-gen-contigs-database, --full-gene-calling-report, that I hope offers a partial solution to this issue. When an output file path is provided to --full-gene-calling-report, anvi-gen-contigs-database generates an additional TAB-delimited file that includes all data from the gene caller and looks like this:

gene_callers_id contig start stop direction partial partial_begin partial_end confidence gc_cont rbs_motif rbs_spacer score cscore rscore sscore start_type translation_table tscore uscore sequence translated_sequence
0 NC_009091 169 1327 f False False False 99.99 0.3013 153.96808116167358 151.14841116167358 -0.9874499999999999 2.8196700000000003 ATG 11 2.8971 0.9100200000000003 ATGGAAATTATTTGTAATCAAAATGAATTA(...) MEIICNQNELNNAIQLVSKAVASRPTHPIL(...)
1 NC_009091 1388 2036 f False False False 99.99 0.2885 GGA/GAG/AGG 5-10bp 81.87259573141999 71.58136573142 2.71875 10.291229999999999 ATG 11 2.8971 4.67538 ATGGTCCTTAATTATGGAAATGGTGAAAAT(...) MVLNYGNGENVWMHPPVHRILGWYSRPSNL(...)
2 NC_009091 2039 4379 f False False False 99.99 0.3260 GGA/GAG/AGG 5-10bp 352.2303246874159 347.0894946874159 2.71875 5.14083 ATG 11 2.8971 -0.47502 ATGATAAATCAAGAAAATAATGATCTATAT(...) MINQENNDLYDLNEALQVENLTLNDYEEIC(...)
3 NC_009091 4426 5887 f False False False 99.99 0.3347 194.19481790304437 188.61028790304437 -0.9874499999999999 5.584530000000002 ATG 11 2.8971 3.6748800000000017 ATGTGCGGAATAGTTGGAATCGTTTCTTCA(...) MCGIVGIVSSDDVNQQIYDSLLLLQHRGQD(...)
4 NC_009091 5883 8325 r False False False 99.99 0.2731 318.93104438253084 315.33533438253085 -0.9874499999999999 3.5957100000000004 ATG 11 2.8971 1.6860600000000003 ATGGATAAGAAAAACTTCACTTCTATCTCA(...) MDKKNFTSISLQEEMQRSYLEYAMSVIVGR(...)
5 NC_009091 8402 9266 r False False False 99.99 0.2719 99.52927145207937 93.12346145207937 -0.9874499999999999 6.405809999999998 ATG 11 2.8971 4.496159999999998 ATGAAAAAGTTTTTACAAAGAATACTCTGG(...) MKKFLQRILWISLISFYFLQIKKVQAIVPY(...)
6 NC_009091 9262 10219 r False False False 99.99 0.3239 107.44082411540528 104.45237411540528 -0.9874499999999999 2.9884500000000003 ATG 11 2.8971 1.0788000000000002 ATGATTAATAGGATTCAAGACAAAAAAGAA(...) MINRIQDKKEISKKLKERAIFEGFTIAGIA(...)
7 NC_009091 10365 11100 f False False False 99.99 0.3319 71.66956065705634 79.71184065705633 -0.9874499999999999 -8.042279999999998 TTG 11 -9.60915 2.55432 TTGGTTGAATCTAATCAAAATCAAGATTCC(...) MVESNQNQDSNLGSRLQQDLKNDLIAGLLV(...)
8 NC_009091 11103 11721 f False False False 99.99 0.2993 69.10757680331382 69.03014680331381 -0.9874499999999999 0.07743000000000055 ATG 11 2.8971 -1.8322199999999995 ATGCATAATAGATCTCTTTCTAGAGAATTA(...) MHNRSLSRELSLISLGLIKDKGDLKLNKFQ(...)
(...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...)

The gene_callers_id column here will obviously match to the gene calls stored in the contigs-db file.

What do you think about this as a solution?