Closed sophie-03 closed 1 year ago
You must be running an old version?
As an aside, I've pulled out all of the GFF3 parsing abilities, improved them, and they are now packaged in a separate tool called gfftk
-- so it should be more suited to handle these types of simple conversions: https://github.com/nextgenusfs/gfftk
I am running version v1.8.16, installed from python -m pip install git+https://github.com/nextgenusfs/funannotate.git
Does gfftk also have a gff2prot utility? I can only see these commands: Commands: consensus EvidenceModeler-like tool to generate consensus gene predictions. convert convert GFF3/tbl format into another format [output gff3, gtf, tbl, gbff, fasta]. sort sort GFF3 file properly [maintain feature order: gene, mrna, exon, cds]. sanitize sanitize GFF3 file, load GFF3 and output cleaned up GFF3 output. rename rename gene models in GFF3 annotation file. stats parse annotation GFF3/tbl and output summary statistics. compare compare two GFF3 annotations of a genome.
There are help menus associated with these which will assist.
Use convert - see gfftk convert -h
gfftk convert --input-format gff3 --output-format proteins -i gff3-file -f fasta-genome -o proteins.fa
can you post the GFF+FASTA that cause the bug to show up - I checking in one simple fix in funannotate to see if that covers the problem there but I cannot reproduce your bug easily to test. You can also try and install with live code python -m pip install git+https://github.com/nextgenusfs/funannotate.git
and see if that handled too.
gfftk convert works great - thank you!
I am also trying to create proteomes for genomes where the gff file was created by liftoff, this seems to be causing some problems and I am wondering if it is due to the format of the liftoff gff, as gfftk convert works perfectly for species where the gff file is not from liftoff.
This is the error I am getting for liftoff species:
Traceback (most recent call last):
File "/home/sm4974/.conda/envs/funannotate/bin/gfftk", line 8, in
sys.exit(main())
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/main.py", line 22, in main
convert(args)
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/convert.py", line 89, in convert
gff2proteins(
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/convert.py", line 366, in gff2proteins
Genes = gff2dict(gff, fasta, table=table, debug=debug)
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 2349, in gff2dict
annotation = validate_models(
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 1258, in validate_models
gene, update = r.result()
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.get_result()
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/_base.py", line 391, in get_result
raise self._exception
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, *self.kwargs)
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 1359, in validate_and_translate_models
if protSeq[-1] == "":
IndexError: string index out of range
Here is an example liftoff gff file: Addax_nasomaculatus_gff.zip
Thanks @sophie-03 I will take a look. The parser should be able to handle these, but I recall liftoff maybe having a few slight tweaks to the format.
@sophie-03 I'm also going to need the fasta file to replicate locally. If you'd be able to trim down to just a few contigs that is still causing the error that would also help. If you don't want to share on here you can email to me: nextgenusfs at gmail dot com.
I will email it over to you as it is too large to upload directly. Thanks for your help!
This appears to be offending model (there maybe more), but basically its an incomplete liftover and should be listed as pseudo. I'll see if I can get code to do that properly.
$ gfftk sanitize -f Addax_nasomaculatus.fa -g Addax_nasomaculatus.gff -o Addax_nasomaculatus.sani.gff3
ERROR in parsing gene gene-C11H17orf50
[('partialStart', [], 0), ('partialStop', [], 0)]
{'name': 'C11H17orf50', 'type': ['mRNA'], 'transcript': ['TCCTGGCCACACACGAAGCATTCCAGGAGCTCCCGCTGGCCGACCAAGAATAGCTGGGATTCCTGGAGAAGGACCCAgctccccagcctccctgtcttCAGTCTCAGCAGGCTTCTGGGCTCTCTAGTCCTGCGCACTCCACACCTGCCAGGGCCTCGTGGGCACACACACTGACCGCGGTCAGAGCGGCCTTCCCGTTGGAGGGGAGGAAGGTCCGGCCAGCTTCAGCCCTCAGCTCCCAGTGGGCATTCCTGGCGCGGAGCCTGCCTTTTTTCTTCTCGGCGCGCGGTGGGGAGAATGGCCTTTAACTGCCCTTGGCTGCAAGGGCTTGGTTCAGCCCGGAGCTGCTGGGAGGAGGCTCTCCGCCTGCCCTTTTCCCTCCGGCTTGAACCCGGGGAAAGGGCAAGGAGGGTTGATCTTTATCACGCTCACCGGGGTGCTCGCTCCTCCTTTAGGCACTGACACTTGGAACCGAGGCTTTATTGGAGGATTTGAGGGGAGGGGCAAGTTTCCCAGAGGAGTTATTTTAGGCGATGCCTCTGCCCACCGGGCCAGAAGGAACCTGAAATACCCGAGACAAGTGAGAGGGCTGTGGGCAGGGGACGGGAGTCAGGTGGGCGATGGGCACCGCGTGCTGACCGGCGAGGCGCTGTTGCAGGACCAGACCTCCAGGTGATGATGTTGGGCTCCGCACTGATAACAGGTGGGGCCGCCCCATCGGGGATGCGCCTGCTGGTCTGCTCCGAGAGGGGGGCGCCCCGGAAGGGCGGCTCTGCGGCACGTGATCGAAGtcggggaagagggaggagggttgCGGGCAGTTGGGGTGGAGGCGAGGATGAAGGGCTACATAGGCAGggagtggtggtgggggtggtggcgCTC'], 'cds_transcript': ['TC'], 'protein': [''], '5UTR': [[]], '3UTR': [[]], 'gene_synonym': [], 'codon_start': [1], 'ids': ['rna-XM_004013198.5'], 'CDS': [[(1206, 1207)]], 'mRNA': [[(323, 1207)]], 'strand': '-', 'EC_number': [[]], 'location': (323, 1207), 'contig': 'JAIEZW010069611.1', 'product': ['chromosome 11 C17orf50 homolog'], 'source': 'Liftoff', 'phase': [0], 'db_xref': [['GeneID:101112334', 'Genbank:XM_004013198.5', 'Genbank:XP_004013247.1']], 'go_terms': [[]], 'note': [[]], 'partialStart': [], 'partialStop': [], 'pseudo': False}
(funannotate2) jon@Jons-MacBook-Pro:~/Downloads$ grep 'gene-C11H17orf50' Addax_nasomaculatus.gff
JAIEZW010069611.1 Liftoff gene 323 1207 . - . ID=gene-C11H17orf50;Dbxref=GeneID:101112334;Name=C11H17orf50;gbkey=Gene;gene=C11H17orf50;gene_biotype=protein_coding;coverage=0.448;sequence_ID=0.436;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-C11H17orf50_0;partial_mapping=True;low_identity=True
JAIEZW010069611.1 Liftoff mRNA 323 1207 . - . ID=rna-XM_004013198.5;Parent=gene-C11H17orf50;Dbxref=GeneID:101112334,Genbank:XM_004013198.5;Name=XM_004013198.5;gbkey=mRNA;gene=C11H17orf50;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 71%25 coverage of the annotated genomic feature by RNAseq alignments;product=chromosome 11 C17orf50 homolog;transcript_id=XM_004013198.5;matches_ref_protein=False;valid_ORF=False;partial_ORF=True;extra_copy_number=0
JAIEZW010069611.1 Liftoff exon 323 1207 . - . ID=exon-XM_004013198.5-3;Parent=rna-XM_004013198.5;Dbxref=GeneID:101112334,Genbank:XM_004013198.5;gbkey=mRNA;gene=C11H17orf50;product=chromosome 11 C17orf50 homolog;transcript_id=XM_004013198.5;extra_copy_number=0
JAIEZW010069611.1 Liftoff CDS 1206 1207 . - . ID=cds-XP_004013247.1;Parent=rna-XM_004013198.5;Dbxref=GeneID:101112334,Genbank:XP_004013247.1;Name=XP_004013247.1;gbkey=CDS;gene=C11H17orf50;product=uncharacterized protein C17orf50 homolog;protein_id=XP_004013247.1;extra_copy_number=0
It is showing a CDS of 1 amino acid.....
Thanks so much! That would be great.
How are you getting that error that specifies the gene with the parsing error? I get the same message as previously when trying sanitize:
$ gfftk sanitize -f Addax_nasomaculatus.fa -g Addax_nasomaculatus.gff -o Addax_nasomaculatus.sani.gff
Traceback (most recent call last):
File "/home/sm4974/.conda/envs/funannotate/bin/gfftk", line 8, in <module>
sys.exit(main())
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/__main__.py", line 26, in main
sanitize(args)
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/sanitize.py", line 5, in sanitize
Genes = gff2dict(args.gff3, args.fasta, debug=args.debug)
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 2349, in gff2dict
annotation = validate_models(
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 1258, in validate_models
gene, update = r.result()
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 1359, in validate_and_translate_models
if protSeq[-1] == "*":
IndexError: string index out of range
I'm running some slightly modified code troubleshooting this error. So it seems that is the only bad model, so the quick fix for this dataset would be just to drop that model like this and then should work:
grep -v 'XM_004013198.5' Addax_nasomaculatus.gff | grep -v 'gene-C11H17orf50' > Addax_nasomaculatus_fixed.gff
I'll need to figure out a set of rules to apply to pick this up, ie probably something like CDS length < some number (ie will be something small like 10 amino acids) than its not a real/valid CDS. Still thinking about consequences of doing something like that. Or I just push the hot fix that allowed user to see the problematic model....
Removing CDS with length <10 seems to solve the problem. Thanks so much for your help!
Hi @nextgenusfs - I am having some problems with other species with liftoff gff files - even after removing CDS <10, any chance you could share with me how you identified which genes in the gff were problematic?
I pushed the code update last night I think, so if you install latest from GFFtk repo should hopefully capture the errors. Can you open a new issue on GFFtk with these issues so I can track properly?
Thanks for the update. All is working now - if I have any more issues I'll open an issue in GFFtk. Thanks again.
I am trying to use the gff2prot utility to create a proteome file. However I am running into an error.
My command line input is:
funannotate util gff2prot -g species.gff -f species.fa
And I am getting the error:
Traceback (most recent call last): File "/home/sm4974/.conda/envs/funannotate/bin/funannotate", line 8, in
sys.exit(main())
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/funannotate/funannotate.py", line 717, in main
mod.main(arguments)
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/funannotate/utilities/gff2prot.py", line 30, in main
Genes = lib.gff2dict(args.gff3, args.fasta, Genes)
File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/funannotate/library.py", line 5362, in gff2dict
log.debug(
NameError: name 'log' is not defined