merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
439 stars 145 forks source link

`external-gene-calls` from `prodigal v2.6.3` are not working when creating contig database #1165

Closed jolespin closed 5 years ago

jolespin commented 5 years ago

I've parsed the orf identifiers from prodigal v2.6.3.

Here's my anvi'o version:

anvi-profile --version
Anvi'o version ...............................: margaret (vunknown)
Profile DB version ...........................: 30
Contigs DB version ...........................: 12
Pan DB version ...............................: 12
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

Here's my output when trying to build contig database:

Input FASTA file .............................: bin_1.fa
Name .........................................: bin_1
Description ..................................: No description is given
Split Length .................................: 20,000
K-mer size ...................................: 4
Skip gene calling? ...........................: False
External gene calls provided? ................: bin_1.external-gene-calls.tsv
Ignoring internal stop codons? ...............: False
Splitting pays attention to gene calls? ......: True
External gene calls ..........................: 1141 gene calls recovered and will be processed.

Config Error: The sequence corresponds to the gene callers id '2' does not seem to have proper
              number of nucleotides to be translated :/ Here it is: ATGAAAGATATTTTTAAAAATTTGAT
              GAAAGTGCTAAAAAGTCCTATTTGACTTTTTTGAGTTGTTTTGTGAATAATAATTATCTCAATTCCTTCATTTTTGTTTA
              TAAGAAAGGAGCATTTTGTATTCAATGAAATAATGCATTTATTTTACTTAGAAGTTAGTTTAAGTATAATCATTGCAGTA
              CTTTTTTGAATATTTTTATGAGCAACTCTTTATAAAATGAAGTATTTTAGTACAGCAAAAAAAAGTGATAGTTCAGTTTG
              AGTGATTGCTTGATTTTTATGAGTTTTAGTTTCAGGTTGTCCTGCTTGTTCTATTTCTCTTGCTTCAATTTTTTGACTTG
              CAGGATTTATACAGTTTTTTCCTTATTGATGAATAGAATTAAAAATTATTTCAGTAGTTTTCCTTCTTTATGCTTGCTAT
              TCAACTTTGAAAAATCTAGAAGTTTGTAAACTAAAACCAAAAAAATCTTA

255

Here's my external-gene-calls:

head bin_1.external-gene-calls.tsv
gene_callers_id contig  start   stop    direction   partial source  version
1   NODE_3_length_131181_cov_12.369248  1   267 f   1   prodigal    v2.6.3
2   NODE_3_length_131181_cov_12.369248  182 658 r   0   prodigal    v2.6.3
3   NODE_3_length_131181_cov_12.369248  802 1635    r   0   prodigal    v2.6.3
4   NODE_3_length_131181_cov_12.369248  1761    2870    r   0   prodigal    v2.6.3
5   NODE_3_length_131181_cov_12.369248  2998    3468    f   0   prodigal    v2.6.3
6   NODE_3_length_131181_cov_12.369248  3495    3896    r   0   prodigal    v2.6.3
7   NODE_3_length_131181_cov_12.369248  4168    7257    f   0   prodigal    v2.6.3
8   NODE_3_length_131181_cov_12.369248  7452    8945    f   0   prodigal    v2.6.3
9   NODE_3_length_131181_cov_12.369248  9051    10151   f   0   prodigal    v2.6.3

Here's my prodigal orf identifiers:

cat bin_1.faa | grep "NODE_3_length_131181_cov_12.369248" | head
>NODE_3_length_131181_cov_12.369248_1 # 1 # 267 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.221
>NODE_3_length_131181_cov_12.369248_2 # 182 # 658 # -1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=10bp;gc_cont=0.249
>NODE_3_length_131181_cov_12.369248_3 # 802 # 1635 # -1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.294
>NODE_3_length_131181_cov_12.369248_4 # 1761 # 2870 # -1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=9bp;gc_cont=0.270
>NODE_3_length_131181_cov_12.369248_5 # 2998 # 3468 # 1 # ID=1_5;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=10bp;gc_cont=0.253
>NODE_3_length_131181_cov_12.369248_6 # 3495 # 3896 # -1 # ID=1_6;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=13bp;gc_cont=0.366
>NODE_3_length_131181_cov_12.369248_7 # 4168 # 7257 # 1 # ID=1_7;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=10bp;gc_cont=0.217
>NODE_3_length_131181_cov_12.369248_8 # 7452 # 8945 # 1 # ID=1_8;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=4bp;gc_cont=0.246
>NODE_3_length_131181_cov_12.369248_9 # 9051 # 10151 # 1 # ID=1_9;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=4bp;gc_cont=0.268
>NODE_3_length_131181_cov_12.369248_10 # 10327 # 11802 # -1 # ID=1_10;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;gc_cont=0.274

This is my function to parse the prodigal identifiers: (yes, I should probably use regex instead but this was a quick test)

    def _parse_prodigal(header):
        fields = header.split(" ")
        partial = fields[-1].split(";partial=")[1].split(";")[0]
        return {
            "gene_callers_id":fields[0],
            "contig":"_".join(fields[0].split("_")[:-1]),
            "start":int(fields[2]),
            "stop":int(fields[4]),
            "direction":{"1":"f", "-1":"r"}[fields[6]],
            "partial":{True:"0",False:"1"}[partial == "00"]
        }

Also, how do I need to use --ignore-internal-stop-codons if the genetic code I'm using has recoded stop codons?

meren commented 5 years ago

If you add -1 to your start values, things will likely work out just fine :)

Please see the anvi'o string indexing convention here:

http://merenlab.org/2016/06/22/anvio-tutorial-v2/#external-gene-calls

jolespin commented 5 years ago

Thanks, it looks like it's working now! I didn't realize it was python-style indexing but I should have known. Is it possible to give translated ORFs to ensure the correct codon table is being used in the translation or an option to specify which codon table is being used in this translation step?

meren commented 5 years ago

Thanks, it looks like it's working now!

Great!

Is it possible to give translated ORFs to ensure the correct codon table is being used in the translation or an option to specify which codon table is being used in this translation step?

This is something we thought about multiple times, however, we did not address it in the codebase. Perhaps a solution could be anvi-gen-contigs-database to accept an additional FASTA files of amino acids for each gene call described in the external gene calls file. But I am not sure when can we do it realistically as we are all swamped with time-sensitive coding tasks.

But you can always replace the amino acid sequences table in contigs database. If you do it in Python, we can add it to the repo as a script (i.e., anvi-script-update-amino-acid-seqeunces; takes a contigs db as an argument, and a FASTA file of amino acids where each defline uniquely corresponds to a gene call id). Just an idea.

Best,

jolespin commented 5 years ago

Yea, I can imagine there being a lot of moving pieces with a tool suite this expansive and such a large user-base.

If I have some extra time between projects I can help out to create a script that does this and/or a pull request with --external-protein-sequences on anvi-gen-contig-database.

Do you have any tutorials that show how to load the contig.db binary file into python (or is it just a pickled dict)? I can also go through the repo and find an example of where it is loaded and use that but thought I would ask first before I go down any rabbit holes.

Cheers

On May 21, 2019, at 7:15 AM, A. Murat Eren notifications@github.com wrote:

Closed #1165.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

meren commented 5 years ago

Hi Josh,

That would be excellent. If you were to be interested in writing a script, you could get some inspiration from previous scripts. Perhaps anvi-script-add-default-collection would be a good start.

In your script the main function would look something like this:

def main(args):
    # makes sure things check out
    utils.is_contigs_db(args.contigs_db)
    filesnpaths.is_file_fasta_formatted(args.fasta)

    # get a contigs db instance
    dbops.ContigsDatabase(args.contigs_db)

    # learn gene caller ids in the congits database:
    gene_calls_in_db = contigs_db.db.get_single_column_from_table('genes_in_contigs', 'gene_callers_id')

    # learn gene caller IDS in args.fasta
    (...)

    # make sure IDs compare well (i.e., there are no ids that are in FASTA but not in contigs db).
    (...)

    # construct your sql query and update aminio acid sequences in amino acid sequences table
    import anvio.tables as t
    table_name = t.gene_amino_acid_sequences_table_name
    contigs_db.db._exec_many(SQL_UPDATE_QUERY_GOES_HERE)
    # there are many examples of _exec_many operations in the codebase).

    # disconnect
    contigs_db.disconnect()

If you decide to take a stab at it and if you need more input please let me know.