nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

NameError: name 'log' is not defined #977

Closed sophie-03 closed 1 year ago

sophie-03 commented 1 year ago

I am trying to use the gff2prot utility to create a proteome file. However I am running into an error.

My command line input is: funannotate util gff2prot -g species.gff -f species.fa

And I am getting the error:

Traceback (most recent call last): File "/home/sm4974/.conda/envs/funannotate/bin/funannotate", line 8, in sys.exit(main()) File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/funannotate/funannotate.py", line 717, in main mod.main(arguments) File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/funannotate/utilities/gff2prot.py", line 30, in main Genes = lib.gff2dict(args.gff3, args.fasta, Genes) File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/funannotate/library.py", line 5362, in gff2dict log.debug( NameError: name 'log' is not defined

nextgenusfs commented 1 year ago

You must be running an old version?

As an aside, I've pulled out all of the GFF3 parsing abilities, improved them, and they are now packaged in a separate tool called gfftk -- so it should be more suited to handle these types of simple conversions: https://github.com/nextgenusfs/gfftk

sophie-03 commented 1 year ago

I am running version v1.8.16, installed from python -m pip install git+https://github.com/nextgenusfs/funannotate.git

Does gfftk also have a gff2prot utility? I can only see these commands: Commands: consensus EvidenceModeler-like tool to generate consensus gene predictions. convert convert GFF3/tbl format into another format [output gff3, gtf, tbl, gbff, fasta]. sort sort GFF3 file properly [maintain feature order: gene, mrna, exon, cds]. sanitize sanitize GFF3 file, load GFF3 and output cleaned up GFF3 output. rename rename gene models in GFF3 annotation file. stats parse annotation GFF3/tbl and output summary statistics. compare compare two GFF3 annotations of a genome.

hyphaltip commented 1 year ago

There are help menus associated with these which will assist.

Use convert - see gfftk convert -h

gfftk convert --input-format gff3 --output-format proteins -i gff3-file -f fasta-genome -o proteins.fa
hyphaltip commented 1 year ago

can you post the GFF+FASTA that cause the bug to show up - I checking in one simple fix in funannotate to see if that covers the problem there but I cannot reproduce your bug easily to test. You can also try and install with live code python -m pip install git+https://github.com/nextgenusfs/funannotate.git and see if that handled too.

sophie-03 commented 1 year ago

gfftk convert works great - thank you!

I am also trying to create proteomes for genomes where the gff file was created by liftoff, this seems to be causing some problems and I am wondering if it is due to the format of the liftoff gff, as gfftk convert works perfectly for species where the gff file is not from liftoff.

This is the error I am getting for liftoff species: Traceback (most recent call last): File "/home/sm4974/.conda/envs/funannotate/bin/gfftk", line 8, in
sys.exit(main()) File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/main.py", line 22, in main
convert(args) File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/convert.py", line 89, in convert
gff2proteins( File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/convert.py", line 366, in gff2proteins
Genes = gff2dict(gff, fasta, table=table, debug=debug) File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 2349, in gff2dict
annotation = validate_models( File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 1258, in validate_models
gene, update = r.result() File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.get_result() File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/_base.py", line 391, in get_result
raise self._exception File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, *self.kwargs) File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 1359, in validate_and_translate_models
if protSeq[-1] == "
": IndexError: string index out of range

Here is an example liftoff gff file: Addax_nasomaculatus_gff.zip

nextgenusfs commented 1 year ago

Thanks @sophie-03 I will take a look. The parser should be able to handle these, but I recall liftoff maybe having a few slight tweaks to the format.

nextgenusfs commented 1 year ago

@sophie-03 I'm also going to need the fasta file to replicate locally. If you'd be able to trim down to just a few contigs that is still causing the error that would also help. If you don't want to share on here you can email to me: nextgenusfs at gmail dot com.

sophie-03 commented 1 year ago

I will email it over to you as it is too large to upload directly. Thanks for your help!

nextgenusfs commented 1 year ago

This appears to be offending model (there maybe more), but basically its an incomplete liftover and should be listed as pseudo. I'll see if I can get code to do that properly.

$ gfftk sanitize -f Addax_nasomaculatus.fa -g Addax_nasomaculatus.gff -o Addax_nasomaculatus.sani.gff3
ERROR in parsing gene gene-C11H17orf50
[('partialStart', [], 0), ('partialStop', [], 0)]
{'name': 'C11H17orf50', 'type': ['mRNA'], 'transcript': ['TCCTGGCCACACACGAAGCATTCCAGGAGCTCCCGCTGGCCGACCAAGAATAGCTGGGATTCCTGGAGAAGGACCCAgctccccagcctccctgtcttCAGTCTCAGCAGGCTTCTGGGCTCTCTAGTCCTGCGCACTCCACACCTGCCAGGGCCTCGTGGGCACACACACTGACCGCGGTCAGAGCGGCCTTCCCGTTGGAGGGGAGGAAGGTCCGGCCAGCTTCAGCCCTCAGCTCCCAGTGGGCATTCCTGGCGCGGAGCCTGCCTTTTTTCTTCTCGGCGCGCGGTGGGGAGAATGGCCTTTAACTGCCCTTGGCTGCAAGGGCTTGGTTCAGCCCGGAGCTGCTGGGAGGAGGCTCTCCGCCTGCCCTTTTCCCTCCGGCTTGAACCCGGGGAAAGGGCAAGGAGGGTTGATCTTTATCACGCTCACCGGGGTGCTCGCTCCTCCTTTAGGCACTGACACTTGGAACCGAGGCTTTATTGGAGGATTTGAGGGGAGGGGCAAGTTTCCCAGAGGAGTTATTTTAGGCGATGCCTCTGCCCACCGGGCCAGAAGGAACCTGAAATACCCGAGACAAGTGAGAGGGCTGTGGGCAGGGGACGGGAGTCAGGTGGGCGATGGGCACCGCGTGCTGACCGGCGAGGCGCTGTTGCAGGACCAGACCTCCAGGTGATGATGTTGGGCTCCGCACTGATAACAGGTGGGGCCGCCCCATCGGGGATGCGCCTGCTGGTCTGCTCCGAGAGGGGGGCGCCCCGGAAGGGCGGCTCTGCGGCACGTGATCGAAGtcggggaagagggaggagggttgCGGGCAGTTGGGGTGGAGGCGAGGATGAAGGGCTACATAGGCAGggagtggtggtgggggtggtggcgCTC'], 'cds_transcript': ['TC'], 'protein': [''], '5UTR': [[]], '3UTR': [[]], 'gene_synonym': [], 'codon_start': [1], 'ids': ['rna-XM_004013198.5'], 'CDS': [[(1206, 1207)]], 'mRNA': [[(323, 1207)]], 'strand': '-', 'EC_number': [[]], 'location': (323, 1207), 'contig': 'JAIEZW010069611.1', 'product': ['chromosome 11 C17orf50 homolog'], 'source': 'Liftoff', 'phase': [0], 'db_xref': [['GeneID:101112334', 'Genbank:XM_004013198.5', 'Genbank:XP_004013247.1']], 'go_terms': [[]], 'note': [[]], 'partialStart': [], 'partialStop': [], 'pseudo': False}

(funannotate2) jon@Jons-MacBook-Pro:~/Downloads$ grep 'gene-C11H17orf50' Addax_nasomaculatus.gff 
JAIEZW010069611.1   Liftoff gene    323 1207    .   -   .   ID=gene-C11H17orf50;Dbxref=GeneID:101112334;Name=C11H17orf50;gbkey=Gene;gene=C11H17orf50;gene_biotype=protein_coding;coverage=0.448;sequence_ID=0.436;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-C11H17orf50_0;partial_mapping=True;low_identity=True
JAIEZW010069611.1   Liftoff mRNA    323 1207    .   -   .   ID=rna-XM_004013198.5;Parent=gene-C11H17orf50;Dbxref=GeneID:101112334,Genbank:XM_004013198.5;Name=XM_004013198.5;gbkey=mRNA;gene=C11H17orf50;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 71%25 coverage of the annotated genomic feature by RNAseq alignments;product=chromosome 11 C17orf50 homolog;transcript_id=XM_004013198.5;matches_ref_protein=False;valid_ORF=False;partial_ORF=True;extra_copy_number=0
JAIEZW010069611.1   Liftoff exon    323 1207    .   -   .   ID=exon-XM_004013198.5-3;Parent=rna-XM_004013198.5;Dbxref=GeneID:101112334,Genbank:XM_004013198.5;gbkey=mRNA;gene=C11H17orf50;product=chromosome 11 C17orf50 homolog;transcript_id=XM_004013198.5;extra_copy_number=0
JAIEZW010069611.1   Liftoff CDS 1206    1207    .   -   .   ID=cds-XP_004013247.1;Parent=rna-XM_004013198.5;Dbxref=GeneID:101112334,Genbank:XP_004013247.1;Name=XP_004013247.1;gbkey=CDS;gene=C11H17orf50;product=uncharacterized protein C17orf50 homolog;protein_id=XP_004013247.1;extra_copy_number=0

It is showing a CDS of 1 amino acid.....

sophie-03 commented 1 year ago

Thanks so much! That would be great.

sophie-03 commented 1 year ago

How are you getting that error that specifies the gene with the parsing error? I get the same message as previously when trying sanitize:

$ gfftk sanitize -f Addax_nasomaculatus.fa -g Addax_nasomaculatus.gff -o Addax_nasomaculatus.sani.gff                                                                                                     
Traceback (most recent call last):                                                                                                             
  File "/home/sm4974/.conda/envs/funannotate/bin/gfftk", line 8, in <module>                                                                   
    sys.exit(main())                                                                                                                           
  File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/__main__.py", line 26, in main                                  
    sanitize(args)                                                                                                                             
  File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/sanitize.py", line 5, in sanitize                               
    Genes = gff2dict(args.gff3, args.fasta, debug=args.debug)                                                                                  
  File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 2349, in gff2dict                                 
    annotation = validate_models(                                                                                                              
  File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 1258, in validate_models                          
    gene, update = r.result()                                                                                                                  
  File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/_base.py", line 439, in result                                   
    return self.__get_result()                                                                                                                 
  File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result                             
    raise self._exception                                                                                                                      
  File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/concurrent/futures/thread.py", line 58, in run                                      
    result = self.fn(*self.args, **self.kwargs)                                                                                                
  File "/home/sm4974/.conda/envs/funannotate/lib/python3.9/site-packages/gfftk/gff.py", line 1359, in validate_and_translate_models           
    if protSeq[-1] == "*":
IndexError: string index out of range
nextgenusfs commented 1 year ago

I'm running some slightly modified code troubleshooting this error. So it seems that is the only bad model, so the quick fix for this dataset would be just to drop that model like this and then should work:

grep -v 'XM_004013198.5' Addax_nasomaculatus.gff | grep -v 'gene-C11H17orf50' > Addax_nasomaculatus_fixed.gff

I'll need to figure out a set of rules to apply to pick this up, ie probably something like CDS length < some number (ie will be something small like 10 amino acids) than its not a real/valid CDS. Still thinking about consequences of doing something like that. Or I just push the hot fix that allowed user to see the problematic model....

sophie-03 commented 1 year ago

Removing CDS with length <10 seems to solve the problem. Thanks so much for your help!

sophie-03 commented 1 year ago

Hi @nextgenusfs - I am having some problems with other species with liftoff gff files - even after removing CDS <10, any chance you could share with me how you identified which genes in the gff were problematic?

nextgenusfs commented 1 year ago

I pushed the code update last night I think, so if you install latest from GFFtk repo should hopefully capture the errors. Can you open a new issue on GFFtk with these issues so I can track properly?

sophie-03 commented 1 year ago

Thanks for the update. All is working now - if I have any more issues I'll open an issue in GFFtk. Thanks again.