Closed hyphaltip closed 6 years ago
Here is debug of the object which is causing fail
transcript for 1 in [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA())]
{'product': ['glycosyltransferase family 32 protein', 'hypothetical protein'], 'name': None, 'db_xref': [[], []], 'go_terms': [[], []], 'transcript': [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA())], 'codon_start': [1, 1], 'ids': ['FE78DRAFT_540541-T1', 'FE78DRAFT_91119-T1'], 'note': [[], []], 'source': 'GenBank', 'cds_transcript': [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA()), Seq('ATGTTGCGCGGGAAGGTTGTGACTCTCCTATTGGCTCATGGAGGTGCAAGCAGC...TGA', IUPACAmbiguousDNA())], 'location': (67776, 68784), 'mRNA': [[(67776, 68784)]], 'CDS': [[(67776, 68438)], [(68515, 68784)]], 'partialStop': [False], 'protein': ['VILITCHHIRDLGEVGRTYLTFNRLIKHHQDVLFQYRPSREGKSLNSEQQCRVPKIIHQVFLADGRNSSLDKYEDALASCHRLHPSWEYRLWTDDNSRSFVRDQYANILPHYKNYRQSIQRANILRYLLLHHYGGVYLDLDITCLTPLDELLHLSWLTPGAYPAGVNNAYVLARKRHTFLVKLLQNVPSRDMRWPMPYVENMLTTGCMFFANRWMRYAWT', 'MLRGKVVTLLLAHGGASSWHEWDDAIILMIGKQYGYFSVIVTVIVAMVIIALRRLLEGRSICRGRRLRSWRGTQEDPNDEERLLINKQG'], 'type': 'mRNA', 'contig': 'JOOL01001072.1', 'strand': '+', 'partialStart': [True]}
id for 1 in ['FE78DRAFT_540541-T1', 'FE78DRAFT_91119-T1'] k={'product': ['glycosyltransferase family 32 protein', 'hypothetical protein'], 'name': None, 'db_xref': [[], []], 'go_terms': [[], []], 'transcript': [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA())], 'codon_start': [1, 1], 'ids': ['FE78DRAFT_540541-T1', 'FE78DRAFT_91119-T1'], 'note': [[], []], 'source': 'GenBank', 'cds_transcript': [Seq('GTCATCCTCATTACTTGCCATCACATACGGGATCTCGGCGAGGTCGGTCGAACA...TGA', IUPACAmbiguousDNA()), Seq('ATGTTGCGCGGGAAGGTTGTGACTCTCCTATTGGCTCATGGAGGTGCAAGCAGC...TGA', IUPACAmbiguousDNA())], 'location': (67776, 68784), 'mRNA': [[(67776, 68784)]], 'CDS': [[(67776, 68438)], [(68515, 68784)]], 'partialStop': [False], 'protein': ['VILITCHHIRDLGEVGRTYLTFNRLIKHHQDVLFQYRPSREGKSLNSEQQCRVPKIIHQVFLADGRNSSLDKYEDALASCHRLHPSWEYRLWTDDNSRSFVRDQYANILPHYKNYRQSIQRANILRYLLLHHYGGVYLDLDITCLTPLDELLHLSWLTPGAYPAGVNNAYVLARKRHTFLVKLLQNVPSRDMRWPMPYVENMLTTGCMFFANRWMRYAWT', 'MLRGKVVTLLLAHGGASSWHEWDDAIILMIGKQYGYFSVIVTVIVAMVIIALRRLLEGRSICRGRRLRSWRGTQEDPNDEERLLINKQG'], 'type': 'mRNA', 'contig': 'JOOL01001072.1', 'strand': '+', 'partialStart': [True]}
Okay the problem is due to overlapping genes inferred FE78DRAFT_540541-T1 and FE78DRAFT_91119-T1 are getting merged in some way because their locations overlap... I thought you could assume every gene/CDS/mRNA has a locus_tag which could be used to group genes instead of locations?
gene 68080..69101
/locus_tag="FE78DRAFT_91119"
mRNA join(68080..68990,69048..69101)
/locus_tag="FE78DRAFT_91119"
/product="hypothetical protein"
CDS 68515..68784
/locus_tag="FE78DRAFT_91119"
/codon_start=1
/product="hypothetical protein"
/protein_id="KXL44899.1"
/db_xref="JGIDB:Aciri1_meta_91119"
/translation="MLRGKVVTLLLAHGGASSWHEWDDAIILMIGKQYGYFSVIVTVI
VAMVIIALRRLLEGRSICRGRRLRSWRGTQEDPNDEERLLINKQG"
gene <67776..68784
/locus_tag="FE78DRAFT_540541"
mRNA <67776..68784
/locus_tag="FE78DRAFT_540541"
/product="glycosyltransferase family 32 protein"
CDS <67776..68438
/locus_tag="FE78DRAFT_540541"
I think this one needs to be updated as well to use the unified parser. It should only be merging the transcripts of the locus_id is the same.
Looks like this function is already updated with latest parser, this shouldn't be causing error. So I downloaded and tried locally and seems to not error on my system:
$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/572/075/GCA_001572075.1_ASM157207v1/GCA_001572075.1_ASM157207v1_genomic.gbff.gz
$ gunzip GCA_001572075.1_ASM157207v1_genomic.gbff.gz
$ funannotate annotate --genbank GCA_001572075.1_ASM157207v1_genomic.gbff -o test
-------------------------------------------------------
[Jul 03 04:51 PM]: OS: MacOSX 10.13.5, 8 cores, ~ 17 GB RAM. Python: 2.7.11
[Jul 03 04:51 PM]: Running funannotate v1.4.0-90e1680
[Jul 03 04:51 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Jul 03 04:51 PM]: Output directory test already exists, will use any existing data. If this is not what you want, exit, and provide a unique name for output folder
[Jul 03 04:51 PM]: Checking GenBank file for annotation
[Jul 03 04:51 PM]: Adding Functional Annotation to Acidomyces richmondensis, NCBI accession: None
[Jul 03 04:51 PM]: Annotation consists of: 10,338 gene models
[Jul 03 04:51 PM]: 10,338 protein records loaded
[Jul 03 04:51 PM]: Running HMMer search of PFAM version 31.0
...
Or wait, are you getting an error on the output after the annotation?
Yes this is on the output part.
On Jul 3, 2018, 1:56 PM -0700, Jon Palmer notifications@github.com, wrote:
Or wait, are you getting an error on the output after the annotation? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Okay... good news is found the bug. Basically when processing NCBI GBK files the scripts find a protein_id
-- which from NCBI looks something like KXL41355.1 for example - as it represents an ID in their protein database. When you are annotating from "scratch" the funannotated genbank files don't have these protein_ids yet, so it defaults to locus_tag-T1
(example). So what is happening is that the protein FASTA file is being generated and using the protein_id field for the unique name, which then doesn't match up with what is written in the tbl file. So need to get this straightened out -- this could end up being somewhat difficult as I've got several checks built in based off the locus_tag-T1
format, so those need to be adjusted accordingly.
Sure - FYI. My use case is using your comparative pipeline on a mix of funannotate and genbank genomes
Alternatively I was thinking of finishing my separate compare pipeline which would fill in annotations in a format that doesnt depend on genbank and the annotations decorated on that. Could run/rerun iprscan etc so could run from jgi / ensembl CDS and pepfiles (or DNA + gff)
Jason Stajich, PhD jasonstajich.phd@gmail.com On Jul 4, 2018, 3:43 PM -0700, Jon Palmer notifications@github.com, wrote:
Okay... good news is found the bug. Basically when processing NCBI GBK files the scripts find a protein_id -- which from NCBI looks something like KXL41355.1 for example - as it represents an ID in their protein database. When you are annotating from "scratch" the funannotated genbank files don't have these protein_ids yet, so it defaults to locus_tag-T1 (example). So what is happening is that the protein FASTA file is being generated and using the protein_id field for the unique name, which then doesn't match up with what is written in the tbl file. So need to get this straightened out -- this could end up being somewhat difficult as I've got several checks built in based off the locus_tag-T1 format, so those need to be adjusted accordingly. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Yeah, that's what I figured - but it should work from genbank file as well. Okay, well what I thought bug was wasn't (although it also needed to be fixed).
$ funannotate annotate --genbank GCA_001572075.1_ASM157207v1_genomic.gbff -o test --cpus 20
-------------------------------------------------------
[Jul 04 04:06 PM]: OS: MacOSX 10.13.5, 24 cores, ~ 67 GB RAM. Python: 2.7.14
[Jul 04 04:06 PM]: Running funannotate v1.4.0-90e1680
[Jul 04 04:06 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Jul 04 04:06 PM]: Output directory test already exists, will use any existing data. If this is not what you want, exit, and provide a unique name for output folder
[Jul 04 04:06 PM]: Checking GenBank file for annotation
[Jul 04 04:07 PM]: Adding Functional Annotation to Acidomyces richmondensis, NCBI accession: None
[Jul 04 04:07 PM]: Annotation consists of: 10,338 gene models
[Jul 04 04:07 PM]: 10,338 protein records loaded
[Jul 04 04:07 PM]: Running HMMer search of PFAM version 31.0
[Jul 04 04:10 PM]: 10,839 annotations added
[Jul 04 04:10 PM]: Running Diamond blastp search of UniProt DB version 2018_06
[Jul 04 04:11 PM]: 608 valid gene/product annotations from 867 total
[Jul 04 04:11 PM]: Running Eggnog-mapper
[Jul 04 04:28 PM]: Parsing EggNog Annotations
[Jul 04 04:28 PM]: 16,052 COG and EggNog annotations added
[Jul 04 04:28 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.10
[Jul 04 04:28 PM]: 1,943 gene name and product description annotations added
[Jul 04 04:28 PM]: Running Diamond blastp search of MEROPS version 12.0
[Jul 04 04:29 PM]: 292 annotations added
[Jul 04 04:29 PM]: Annotating CAZYmes using HMMer search of dbCAN version 6.0
[Jul 04 04:29 PM]: 570 annotations added
[Jul 04 04:29 PM]: Annotating proteins with BUSCO dikarya models
[Jul 04 04:30 PM]: 1,261 annotations added
[Jul 04 04:30 PM]: Skipping phobius predictions, try funannotate remote -m phobius
[Jul 04 04:30 PM]: Predicting secreted proteins with SignalP
[Jul 04 04:37 PM]: 635 secretome and 0 transmembane annotations added
[Jul 04 04:37 PM]: InterProScan error, test/annotate_misc/iprscan.xml is empty, or no XML file passed via --iprscan. Functional annotation will be lacking.
[Jul 04 04:37 PM]: Found 33,535 duplicated annotations, adding 0 valid annotations
[Jul 04 04:37 PM]: Converting to final Genbank format, good luck!
Traceback (most recent call last):
File "/usr/local/funannotate/bin/funannotate-functional.py", line 1040, in <module>
lib.gb2allout(final_gbk, final_gff, final_proteins, final_transcripts, final_fasta)
File "/usr/local/funannotate/lib/library.py", line 3057, in gb2allout
#now write mRNA feature
IndexError: list index out of range
I'm still getting the error, which doesn't make any sense, here is the tbl annotation file going into tbl2asn
:
<67776 68784 gene
locus_tag FE78DRAFT_540541
<67776 68784 mRNA
product glycosyltransferase family 32 protein
transcript_id gnl|ncbi|FE78DRAFT_540541-T1_mrna
protein_id gnl|ncbi|FE78DRAFT_540541-T1
<67776 68438 CDS
codon_start 1
product glycosyltransferase family 32 protein
transcript_id gnl|ncbi|FE78DRAFT_540541-T1_mrna
protein_id gnl|ncbi|FE78DRAFT_540541-T1
68080 69101 gene
locus_tag FE78DRAFT_91119
68080 68990 mRNA
69048 69101
product hypothetical protein
transcript_id gnl|ncbi|FE78DRAFT_91119-T1_mrna
protein_id gnl|ncbi|FE78DRAFT_91119-T1
68515 68784 CDS
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FE78DRAFT_91119-T1_mrna
protein_id gnl|ncbi|FE78DRAFT_91119-T1
Yet somehow this is what is being written in the GenBank file --> note the locus_tag in the second gene. So the parser sees the locus_tag and puts that CDS into the other gene location and then hence the error...
gene <67776..68784
/locus_tag="FE78DRAFT_540541"
mRNA <67776..68784
/locus_tag="FE78DRAFT_540541"
/product="glycosyltransferase family 32 protein"
CDS <67776..68438
/locus_tag="FE78DRAFT_540541"
/codon_start=1
/product="glycosyltransferase family 32 protein"
/protein_id="ncbi:FE78DRAFT_540541-T1"
/translation="VILITCHHIRDLGEVGRTYLTFNRLIKHHQDVLFQYRPSREGKS
LNSEQQCRVPKIIHQVFLADGRNSSLDKYEDALASCHRLHPSWEYRLWTDDNSRSFVR
DQYANILPHYKNYRQSIQRANILRYLLLHHYGGVYLDLDITCLTPLDELLHLSWLTPG
AYPAGVNNAYVLARKRHTFLVKLLQNVPSRDMRWPMPYVENMLTTGCMFFANRWMRYA
WT"
gene 68080..69101
/locus_tag="FE78DRAFT_91119"
mRNA join(68080..68990,69048..69101)
/locus_tag="FE78DRAFT_91119"
/product="hypothetical protein"
CDS 68515..68784
/locus_tag="FE78DRAFT_540541"
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FE78DRAFT_91119-T1"
/translation="MLRGKVVTLLLAHGGASSWHEWDDAIILMIGKQYGYFSVIVTVI
VAMVIIALRRLLEGRSICRGRRLRSWRGTQEDPNDEERLLINKQG"
The other thing very strange about this annotation is that many/most of these gene models should have failed -- they are listed as partial when they are not, etc.
It almost seems like it is a tbl2asn error - but I've never seen this before. Let me pull a few other NCBI GBK genomes and see if this is repeatable on other genomes. In the meantime, I'll push the fix for the annotations not being added correctly (which is what I thought the error was).
Processing a funannotate annotate on an existing genbank genome
https://www.ncbi.nlm.nih.gov/genome/?term=txid245562[Organism:exp]
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/572/075/GCA_001572075.1_ASM157207v1/GCA_001572075.1_ASM157207v1_genomic.gbff.gz
I get this when the annotate it writing back out the files in the annotate_results folder. I'm working on it but wanted to record this as a bug and then when we get a fix in can refer to this.