nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
321 stars 85 forks source link

transcript writing failing after re-annotation a genbank file #190

Closed hyphaltip closed 6 years ago

hyphaltip commented 6 years ago

Processing a funannotate annotate on an existing genbank genome

https://www.ncbi.nlm.nih.gov/genome/?term=txid245562[Organism:exp]

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/572/075/GCA_001572075.1_ASM157207v1/GCA_001572075.1_ASM157207v1_genomic.gbff.gz

I get this when the annotate it writing back out the files in the annotate_results folder. I'm working on it but wanted to record this as a bug and then when we get a fix in can refer to this.

  File "/bigdata/operations/pkgadmin/opt/linux/centos/7.x/x86_64/pkgs/funannotate/git-live/bin/funannotate-functional.py", line 1040, in <module>
    lib.gb2allout(final_gbk, final_gff, final_proteins, final_transcripts, final_fasta)
  File "/bigdata/operations/pkgadmin/opt/linux/centos/7.x/x86_64/pkgs/funannotate/git-live/lib/library.py", line 3056, in gb2allout
    tranout.write('>%s %s\n%s\n' % (v['ids'][i], k, v['transcript'][i]))
IndexError: list index out of range
hyphaltip commented 6 years ago

Here is debug of the object which is causing fail

transcript for 1 in [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA())]
{'product': ['glycosyltransferase family 32 protein', 'hypothetical protein'], 'name': None, 'db_xref': [[], []], 'go_terms': [[], []], 'transcript': [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA())], 'codon_start': [1, 1], 'ids': ['FE78DRAFT_540541-T1', 'FE78DRAFT_91119-T1'], 'note': [[], []], 'source': 'GenBank', 'cds_transcript': [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA()), Seq('ATGTTGCGCGGGAAGGTTGTGACTCTCCTATTGGCTCATGGAGGTGCAAGCAGC...TGA', IUPACAmbiguousDNA())], 'location': (67776, 68784), 'mRNA': [[(67776, 68784)]], 'CDS': [[(67776, 68438)], [(68515, 68784)]], 'partialStop': [False], 'protein': ['VILITCHHIRDLGEVGRTYLTFNRLIKHHQDVLFQYRPSREGKSLNSEQQCRVPKIIHQVFLADGRNSSLDKYEDALASCHRLHPSWEYRLWTDDNSRSFVRDQYANILPHYKNYRQSIQRANILRYLLLHHYGGVYLDLDITCLTPLDELLHLSWLTPGAYPAGVNNAYVLARKRHTFLVKLLQNVPSRDMRWPMPYVENMLTTGCMFFANRWMRYAWT', 'MLRGKVVTLLLAHGGASSWHEWDDAIILMIGKQYGYFSVIVTVIVAMVIIALRRLLEGRSICRGRRLRSWRGTQEDPNDEERLLINKQG'], 'type': 'mRNA', 'contig': 'JOOL01001072.1', 'strand': '+', 'partialStart': [True]}
id for 1 in ['FE78DRAFT_540541-T1', 'FE78DRAFT_91119-T1'] k={'product': ['glycosyltransferase family 32 protein', 'hypothetical protein'], 'name': None, 'db_xref': [[], []], 'go_terms': [[], []], 'transcript': [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA())], 'codon_start': [1, 1], 'ids': ['FE78DRAFT_540541-T1', 'FE78DRAFT_91119-T1'], 'note': [[], []], 'source': 'GenBank', 'cds_transcript': [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA()), Seq('ATGTTGCGCGGGAAGGTTGTGACTCTCCTATTGGCTCATGGAGGTGCAAGCAGC...TGA', IUPACAmbiguousDNA())], 'location': (67776, 68784), 'mRNA': [[(67776, 68784)]], 'CDS': [[(67776, 68438)], [(68515, 68784)]], 'partialStop': [False], 'protein': ['VILITCHHIRDLGEVGRTYLTFNRLIKHHQDVLFQYRPSREGKSLNSEQQCRVPKIIHQVFLADGRNSSLDKYEDALASCHRLHPSWEYRLWTDDNSRSFVRDQYANILPHYKNYRQSIQRANILRYLLLHHYGGVYLDLDITCLTPLDELLHLSWLTPGAYPAGVNNAYVLARKRHTFLVKLLQNVPSRDMRWPMPYVENMLTTGCMFFANRWMRYAWT', 'MLRGKVVTLLLAHGGASSWHEWDDAIILMIGKQYGYFSVIVTVIVAMVIIALRRLLEGRSICRGRRLRSWRGTQEDPNDEERLLINKQG'], 'type': 'mRNA', 'contig': 'JOOL01001072.1', 'strand': '+', 'partialStart': [True]}
hyphaltip commented 6 years ago

Okay the problem is due to overlapping genes inferred FE78DRAFT_540541-T1 and FE78DRAFT_91119-T1 are getting merged in some way because their locations overlap... I thought you could assume every gene/CDS/mRNA has a locus_tag which could be used to group genes instead of locations?

     gene            68080..69101
                     /locus_tag="FE78DRAFT_91119"
     mRNA            join(68080..68990,69048..69101)
                     /locus_tag="FE78DRAFT_91119"
                     /product="hypothetical protein"
     CDS             68515..68784
                     /locus_tag="FE78DRAFT_91119"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="KXL44899.1"
                     /db_xref="JGIDB:Aciri1_meta_91119"
                     /translation="MLRGKVVTLLLAHGGASSWHEWDDAIILMIGKQYGYFSVIVTVI
                     VAMVIIALRRLLEGRSICRGRRLRSWRGTQEDPNDEERLLINKQG"
    gene            <67776..68784
                     /locus_tag="FE78DRAFT_540541"
     mRNA            <67776..68784
                     /locus_tag="FE78DRAFT_540541"
                     /product="glycosyltransferase family 32 protein"
     CDS             <67776..68438
                     /locus_tag="FE78DRAFT_540541"
nextgenusfs commented 6 years ago

I think this one needs to be updated as well to use the unified parser. It should only be merging the transcripts of the locus_id is the same.

nextgenusfs commented 6 years ago

Looks like this function is already updated with latest parser, this shouldn't be causing error. So I downloaded and tried locally and seems to not error on my system:

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/572/075/GCA_001572075.1_ASM157207v1/GCA_001572075.1_ASM157207v1_genomic.gbff.gz

$ gunzip  GCA_001572075.1_ASM157207v1_genomic.gbff.gz

$ funannotate annotate --genbank GCA_001572075.1_ASM157207v1_genomic.gbff -o test 
-------------------------------------------------------
[Jul 03 04:51 PM]: OS: MacOSX 10.13.5, 8 cores, ~ 17 GB RAM. Python: 2.7.11
[Jul 03 04:51 PM]: Running funannotate v1.4.0-90e1680
[Jul 03 04:51 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Jul 03 04:51 PM]: Output directory test already exists, will use any existing data.  If this is not what you want, exit, and provide a unique name for output folder
[Jul 03 04:51 PM]: Checking GenBank file for annotation
[Jul 03 04:51 PM]: Adding Functional Annotation to Acidomyces richmondensis, NCBI accession: None
[Jul 03 04:51 PM]: Annotation consists of: 10,338 gene models
[Jul 03 04:51 PM]: 10,338 protein records loaded
[Jul 03 04:51 PM]: Running HMMer search of PFAM version 31.0
...
nextgenusfs commented 6 years ago

Or wait, are you getting an error on the output after the annotation?

hyphaltip commented 6 years ago

Yes this is on the output part.

On Jul 3, 2018, 1:56 PM -0700, Jon Palmer notifications@github.com, wrote:

Or wait, are you getting an error on the output after the annotation? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

nextgenusfs commented 6 years ago

Okay... good news is found the bug. Basically when processing NCBI GBK files the scripts find a protein_id -- which from NCBI looks something like KXL41355.1 for example - as it represents an ID in their protein database. When you are annotating from "scratch" the funannotated genbank files don't have these protein_ids yet, so it defaults to locus_tag-T1 (example). So what is happening is that the protein FASTA file is being generated and using the protein_id field for the unique name, which then doesn't match up with what is written in the tbl file. So need to get this straightened out -- this could end up being somewhat difficult as I've got several checks built in based off the locus_tag-T1 format, so those need to be adjusted accordingly.

hyphaltip commented 6 years ago

Sure - FYI. My use case is using your comparative pipeline on a mix of funannotate and genbank genomes

Alternatively I was thinking of finishing my separate compare pipeline which would fill in annotations in a format that doesnt depend on genbank and the annotations decorated on that. Could run/rerun iprscan etc so could run from jgi / ensembl CDS and pepfiles (or DNA + gff)

Jason Stajich, PhD jasonstajich.phd@gmail.com On Jul 4, 2018, 3:43 PM -0700, Jon Palmer notifications@github.com, wrote:

Okay... good news is found the bug. Basically when processing NCBI GBK files the scripts find a protein_id -- which from NCBI looks something like KXL41355.1 for example - as it represents an ID in their protein database. When you are annotating from "scratch" the funannotated genbank files don't have these protein_ids yet, so it defaults to locus_tag-T1 (example). So what is happening is that the protein FASTA file is being generated and using the protein_id field for the unique name, which then doesn't match up with what is written in the tbl file. So need to get this straightened out -- this could end up being somewhat difficult as I've got several checks built in based off the locus_tag-T1 format, so those need to be adjusted accordingly. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

nextgenusfs commented 6 years ago

Yeah, that's what I figured - but it should work from genbank file as well. Okay, well what I thought bug was wasn't (although it also needed to be fixed).

$ funannotate annotate --genbank GCA_001572075.1_ASM157207v1_genomic.gbff -o test --cpus 20
-------------------------------------------------------
[Jul 04 04:06 PM]: OS: MacOSX 10.13.5, 24 cores, ~ 67 GB RAM. Python: 2.7.14
[Jul 04 04:06 PM]: Running funannotate v1.4.0-90e1680
[Jul 04 04:06 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Jul 04 04:06 PM]: Output directory test already exists, will use any existing data.  If this is not what you want, exit, and provide a unique name for output folder
[Jul 04 04:06 PM]: Checking GenBank file for annotation
[Jul 04 04:07 PM]: Adding Functional Annotation to Acidomyces richmondensis, NCBI accession: None
[Jul 04 04:07 PM]: Annotation consists of: 10,338 gene models
[Jul 04 04:07 PM]: 10,338 protein records loaded
[Jul 04 04:07 PM]: Running HMMer search of PFAM version 31.0
[Jul 04 04:10 PM]: 10,839 annotations added
[Jul 04 04:10 PM]: Running Diamond blastp search of UniProt DB version 2018_06
[Jul 04 04:11 PM]: 608 valid gene/product annotations from 867 total
[Jul 04 04:11 PM]: Running Eggnog-mapper
[Jul 04 04:28 PM]: Parsing EggNog Annotations
[Jul 04 04:28 PM]: 16,052 COG and EggNog annotations added
[Jul 04 04:28 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.10
[Jul 04 04:28 PM]: 1,943 gene name and product description annotations added
[Jul 04 04:28 PM]: Running Diamond blastp search of MEROPS version 12.0
[Jul 04 04:29 PM]: 292 annotations added
[Jul 04 04:29 PM]: Annotating CAZYmes using HMMer search of dbCAN version 6.0
[Jul 04 04:29 PM]: 570 annotations added
[Jul 04 04:29 PM]: Annotating proteins with BUSCO dikarya models
[Jul 04 04:30 PM]: 1,261 annotations added
[Jul 04 04:30 PM]: Skipping phobius predictions, try funannotate remote -m phobius
[Jul 04 04:30 PM]: Predicting secreted proteins with SignalP
[Jul 04 04:37 PM]: 635 secretome and 0 transmembane annotations added
[Jul 04 04:37 PM]: InterProScan error, test/annotate_misc/iprscan.xml is empty, or no XML file passed via --iprscan. Functional annotation will be lacking.
[Jul 04 04:37 PM]: Found 33,535 duplicated annotations, adding 0 valid annotations
[Jul 04 04:37 PM]: Converting to final Genbank format, good luck!
Traceback (most recent call last):
  File "/usr/local/funannotate/bin/funannotate-functional.py", line 1040, in <module>
    lib.gb2allout(final_gbk, final_gff, final_proteins, final_transcripts, final_fasta)
  File "/usr/local/funannotate/lib/library.py", line 3057, in gb2allout
    #now write mRNA feature
IndexError: list index out of range

I'm still getting the error, which doesn't make any sense, here is the tbl annotation file going into tbl2asn:

<67776  68784   gene
            locus_tag   FE78DRAFT_540541
<67776  68784   mRNA
            product glycosyltransferase family 32 protein
            transcript_id   gnl|ncbi|FE78DRAFT_540541-T1_mrna
            protein_id  gnl|ncbi|FE78DRAFT_540541-T1
<67776  68438   CDS
            codon_start 1
            product glycosyltransferase family 32 protein
            transcript_id   gnl|ncbi|FE78DRAFT_540541-T1_mrna
            protein_id  gnl|ncbi|FE78DRAFT_540541-T1
68080   69101   gene
            locus_tag   FE78DRAFT_91119
68080   68990   mRNA
69048   69101
            product hypothetical protein
            transcript_id   gnl|ncbi|FE78DRAFT_91119-T1_mrna
            protein_id  gnl|ncbi|FE78DRAFT_91119-T1
68515   68784   CDS
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|FE78DRAFT_91119-T1_mrna
            protein_id  gnl|ncbi|FE78DRAFT_91119-T1

Yet somehow this is what is being written in the GenBank file --> note the locus_tag in the second gene. So the parser sees the locus_tag and puts that CDS into the other gene location and then hence the error...

     gene            <67776..68784
                     /locus_tag="FE78DRAFT_540541"
     mRNA            <67776..68784
                     /locus_tag="FE78DRAFT_540541"
                     /product="glycosyltransferase family 32 protein"
     CDS             <67776..68438
                     /locus_tag="FE78DRAFT_540541"
                     /codon_start=1
                     /product="glycosyltransferase family 32 protein"
                     /protein_id="ncbi:FE78DRAFT_540541-T1"
                     /translation="VILITCHHIRDLGEVGRTYLTFNRLIKHHQDVLFQYRPSREGKS
                     LNSEQQCRVPKIIHQVFLADGRNSSLDKYEDALASCHRLHPSWEYRLWTDDNSRSFVR
                     DQYANILPHYKNYRQSIQRANILRYLLLHHYGGVYLDLDITCLTPLDELLHLSWLTPG
                     AYPAGVNNAYVLARKRHTFLVKLLQNVPSRDMRWPMPYVENMLTTGCMFFANRWMRYA
                     WT"
     gene            68080..69101
                     /locus_tag="FE78DRAFT_91119"
     mRNA            join(68080..68990,69048..69101)
                     /locus_tag="FE78DRAFT_91119"
                     /product="hypothetical protein"
     CDS             68515..68784
                     /locus_tag="FE78DRAFT_540541"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:FE78DRAFT_91119-T1"
                     /translation="MLRGKVVTLLLAHGGASSWHEWDDAIILMIGKQYGYFSVIVTVI
                     VAMVIIALRRLLEGRSICRGRRLRSWRGTQEDPNDEERLLINKQG"

The other thing very strange about this annotation is that many/most of these gene models should have failed -- they are listed as partial when they are not, etc.

It almost seems like it is a tbl2asn error - but I've never seen this before. Let me pull a few other NCBI GBK genomes and see if this is repeatable on other genomes. In the meantime, I'll push the fix for the annotations not being added correctly (which is what I thought the error was).