Open Ge0rges opened 7 months ago
Just a quick note as I'm passing through this: When we do this we need to think of a way that doesn't require a specific design that locks us in with Prodigal for gene calling. A generic design that can keep track of additional features for genes (or genomic regions, or nucleotides, or codons) that can also be populated from Prodigal output.
Because I understand this exists in the context of broader changes that need be done (that are beyond my current mastery of the Anvi'o codebase), here is a temporary pseudo-solution for anyone who ends up here.
I wrote this script which essentially piggy backs on Anvi'o prodigal caller and response parser:
import sys
sys.path.insert(1, '/path/to/anvio/anvio/drivers')
from prodigal import Prodigal
class ExtendedProdigal(Prodigal):
def __init__(self):
super().__init__()
self.available_parsers = {'v2.6.3': self.extended_parser,
'v2.6.2': self.extended_parser,
'v2.6.0': self.extended_parser}
super().check_version()
def extended_parser(self, defline):
"""parses this, but keep more information than the default anvio parser:
204_10M_MERGED.PERFECT.gz.keep_contig_1720_7 # 7086 # 7262 # 1 # ID=3_7;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.294
"""
hit = self._Prodigal__parser_1(defline)
fields = defline.split()
additional_attributes = fields[8].split(';')
hit['start_codon'] = additional_attributes[2].split('=')[1]
hit['rbs_motif'] = additional_attributes[3].split('=')[1]
hit['rbs_spacer'] = additional_attributes[4].split('=')[1]
hit['gc_cont'] = additional_attributes[5].split('=')[1]
return hit
prodigal = ExtendedProdigal()
prodigal.num_threads = int(sys.argv[3])
gene_calls, amino_acids = prodigal.process(sys.argv[1], sys.argv[2])
This won't update your contigs database or otherwise modify any Anvi'o functionality, however if you call it with a FASTA as input, it will return the same dict anvi'o generates in addition to keeping the other prodigal outputs chosen here (e.g. 'gc_cont'). Example run command python prodigal_caller.py mags/Pelagibacter_r-contigs.fna output_folder 40
where the format is python script_name.py input_fasta_path output_folder_for_anvio thread_number
.
@meren sidenote: while working on this I noticed an important typo here. I believe that line should read v2.6.0
. I didn't want to make a pull request just for that one typo though (but I can if you'd like!).
Thanks for letting me know about the typo, @Ge0rges. I'm not sure how did it survive this long. I guess because no one is using Prodigal v2.6.0
anymore.
Your temporary workaround is masterful and beautiful. Regarding the original feature request: this has been a difficult one to address because it requires a change in the way we keep gene calls in our relevant table with the addition of a few new columns, which will likely add millions of additional data points to that table, increasing the contigs-db size by a lot while only being relevant to a fraction of the users.
A better solution would be to extend that table if anvi-gen-contigs-db
receives a specific flag (i.e., --store-extended-gene-call-data
) to store larger amount of information regarding gene calls, rather than doing it every time anvi-gen-contigs-db
. But since this table will already go through a change soon with the eukaryotic gene calls, I've been sitting on my hands :)
That makes sense. I was wondering if the revamped mentioned in #2152 is thought of to affect the contigs-db or to involve the creation of new type of artifact centered around genomes/MAGs? If the latter was the case, this feature could be relegated to that artifact rather than expanding on contigs-db.
I think it will have to be new, optional tables in contigs-db. We already have the code to mark nucleotide, codon/amino acid positions in contig sequences in contigs-db files, but they are not used outside of anvi'o structure currently. We will have to make them more accessible to mainstream programs :)
The best way to get these things done is to have a project in the lab that needs this solution to be in place. That's why there is a delay currently :(
Hi @meren I've run into an issue with my workaround and wanted to see if you had any ideas. My goal is to get list of gene calls, and a list of functions, (for an anvio curated MAG, i.e. bin in my contigs DB/ a fasta file) that I can use separately and then downstream match the gene_callers_id
to execute other analyses.
The issue with my workaround is that it produces an export of gene calls with gene_callers_id
different to those that are already in the contigs db. But simultaneously, I need that extra metadata from the more extensive prodigal output parsing.
What would be the good way to go about this? I thought of modifying the function annotation commands to take in a fasta file as well but that seems pretty daunting and perhaps inefficient?
Dear @Ge0rges, I just sent a PR for anvio-dev (#2306), which replaces prodigal with pyrodigal. The PR also includes a new parameter for anvi-gen-contigs-database
, --full-gene-calling-report
, that I hope offers a partial solution to this issue. When an output file path is provided to --full-gene-calling-report
, anvi-gen-contigs-database
generates an additional TAB-delimited file that includes all data from the gene caller and looks like this:
gene_callers_id | contig | start | stop | direction | partial | partial_begin | partial_end | confidence | gc_cont | rbs_motif | rbs_spacer | score | cscore | rscore | sscore | start_type | translation_table | tscore | uscore | sequence | translated_sequence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NC_009091 | 169 | 1327 | f | False | False | False | 99.99 | 0.3013 | 153.96808116167358 | 151.14841116167358 | -0.9874499999999999 | 2.8196700000000003 | ATG | 11 | 2.8971 | 0.9100200000000003 | ATGGAAATTATTTGTAATCAAAATGAATTA(...) | MEIICNQNELNNAIQLVSKAVASRPTHPIL(...) | ||
1 | NC_009091 | 1388 | 2036 | f | False | False | False | 99.99 | 0.2885 | GGA/GAG/AGG | 5-10bp | 81.87259573141999 | 71.58136573142 | 2.71875 | 10.291229999999999 | ATG | 11 | 2.8971 | 4.67538 | ATGGTCCTTAATTATGGAAATGGTGAAAAT(...) | MVLNYGNGENVWMHPPVHRILGWYSRPSNL(...) |
2 | NC_009091 | 2039 | 4379 | f | False | False | False | 99.99 | 0.3260 | GGA/GAG/AGG | 5-10bp | 352.2303246874159 | 347.0894946874159 | 2.71875 | 5.14083 | ATG | 11 | 2.8971 | -0.47502 | ATGATAAATCAAGAAAATAATGATCTATAT(...) | MINQENNDLYDLNEALQVENLTLNDYEEIC(...) |
3 | NC_009091 | 4426 | 5887 | f | False | False | False | 99.99 | 0.3347 | 194.19481790304437 | 188.61028790304437 | -0.9874499999999999 | 5.584530000000002 | ATG | 11 | 2.8971 | 3.6748800000000017 | ATGTGCGGAATAGTTGGAATCGTTTCTTCA(...) | MCGIVGIVSSDDVNQQIYDSLLLLQHRGQD(...) | ||
4 | NC_009091 | 5883 | 8325 | r | False | False | False | 99.99 | 0.2731 | 318.93104438253084 | 315.33533438253085 | -0.9874499999999999 | 3.5957100000000004 | ATG | 11 | 2.8971 | 1.6860600000000003 | ATGGATAAGAAAAACTTCACTTCTATCTCA(...) | MDKKNFTSISLQEEMQRSYLEYAMSVIVGR(...) | ||
5 | NC_009091 | 8402 | 9266 | r | False | False | False | 99.99 | 0.2719 | 99.52927145207937 | 93.12346145207937 | -0.9874499999999999 | 6.405809999999998 | ATG | 11 | 2.8971 | 4.496159999999998 | ATGAAAAAGTTTTTACAAAGAATACTCTGG(...) | MKKFLQRILWISLISFYFLQIKKVQAIVPY(...) | ||
6 | NC_009091 | 9262 | 10219 | r | False | False | False | 99.99 | 0.3239 | 107.44082411540528 | 104.45237411540528 | -0.9874499999999999 | 2.9884500000000003 | ATG | 11 | 2.8971 | 1.0788000000000002 | ATGATTAATAGGATTCAAGACAAAAAAGAA(...) | MINRIQDKKEISKKLKERAIFEGFTIAGIA(...) | ||
7 | NC_009091 | 10365 | 11100 | f | False | False | False | 99.99 | 0.3319 | 71.66956065705634 | 79.71184065705633 | -0.9874499999999999 | -8.042279999999998 | TTG | 11 | -9.60915 | 2.55432 | TTGGTTGAATCTAATCAAAATCAAGATTCC(...) | MVESNQNQDSNLGSRLQQDLKNDLIAGLLV(...) | ||
8 | NC_009091 | 11103 | 11721 | f | False | False | False | 99.99 | 0.2993 | 69.10757680331382 | 69.03014680331381 | -0.9874499999999999 | 0.07743000000000055 | ATG | 11 | 2.8971 | -1.8322199999999995 | ATGCATAATAGATCTCTTTCTAGAGAATTA(...) | MHNRSLSRELSLISLGLIKDKGDLKLNKFQ(...) | ||
(...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) |
The gene_callers_id
column here will obviously match to the gene calls stored in the contigs-db
file.
What do you think about this as a solution?
The need
Identifying things like RBS motif, start codon, etc. can come in handy with gene-aware analyses. Such analyses may become more frequent in the future especially given the current effort
2152
The solution
From @ivagljiva on discord:
Perhaps this should be relegated to the effort mentioned by @semiller10 in #2152, but I thought it pertinent to bring it up.
Beneficiaries
Anyone doing gene aware analyses.