nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

lib.RepeatBlast error parsing blast-xml #211

Closed AntoineHo closed 5 years ago

AntoineHo commented 5 years ago

Hello,

I have an issue running funannotate predict: funannotate predict -i genome.cleaned.sorted.masked.fa -o predict/ --species "MySpecies" --transcript_evidence TSA/allmRNA.evidence.fa --protein_evidence Proteins/proteins.fa --cpus 10 --busco_db metazoa

I have the following output:

-------------------------------------------------------
[07:23 PM]: OS: linux2, 12 cores, ~ 33 GB RAM. Python: 2.7.15
[07:23 PM]: Running funannotate v1.4.2
[07:23 PM]: Augustus training set for MySpecies already exists. To re-train provide unique --augustus_species argument
[07:23 PM]: AUGUSTUS (3.3.1) detected, version seems to be compatible with BRAKER and BUSCO
[07:23 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
[07:23 PM]: Genome loaded: 753 scaffolds; 147,758,537 bp; 3.00% repeats masked
[07:23 PM]: Aligning transcript evidence to genome with minimap2
[07:24 PM]: Found 66,850 alignments, wrote GFF3 and Augustus hints to file
[07:24 PM]: Mapping proteins to genome using Diamond blastx/Exonerate
/home/antoine/tools/EVidenceModeler-1.1.1/EvmUtils/misc/exonerate_gff_to_alignment_gff3.pl
[07:24 PM]: Using 1,675 proteins as queries
[07:24 PM]: Running Diamond pre-filter search
[07:25 PM]: Found 14,857 preliminary alignments
[07:26 PM]: Exonerate finished: found 8,711 alignments
[07:27 PM]: Running GeneMark-ES on assembly
[07:59 PM]: Converting GeneMark GTF file to GFF3
[07:59 PM]: Found 34,189 gene models
[07:59 PM]: Running Augustus gene prediction
[08:21 PM]: Found 29,191 gene models
[08:21 PM]: Pulling out high quality Augustus predictions
[08:21 PM]: Found 7,185 high quality predictions from Augustus (>90% exon evidence)
[08:21 PM]: Summary of gene models passed to EVM (weights):
-------------------------------------------------------
Augustus models (1):    22,006
Genemark models (1):    34,189
HiQ models (5):     7,185
Pasa models (1):    0
Total models:       63,380
-------------------------------------------------------
[08:21 PM]: Setting up EVM partitions
[08:24 PM]: Generating EVM command list
[08:24 PM]: Running EVM commands with 9 CPUs
[08:46 PM]: Combining partitioned EVM outputs
[08:46 PM]: Converting EVM output to GFF3
[08:47 PM]: Collecting all EVM results
[08:47 PM]: 34,055 total gene models from EVM
[08:47 PM]: Generating protein fasta files from 34,055 EVM models
[08:48 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
Traceback (most recent call last):
  File "/home/antoine/tools/funannotate-1.4.2/bin/funannotate-predict.py", line 1235, in <module>
    lib.RepeatBlast(EVM_proteins, args.cpus, 1e-10, FUNDB, os.path.join(args.out, 'predict_misc'), Blast_rep_remove)
  File "/home/antoine/tools/funannotate-1.4.2/lib/library.py", line 3472, in RepeatBlast
    for qresult in SearchIO.parse(results, "blast-xml"):
  File "/home/antoine/anaconda3/envs/fun-py27/lib/python2.7/site-packages/Bio/SearchIO/__init__.py", line 308, in parse
    for qresult in generator:
  File "/home/antoine/anaconda3/envs/fun-py27/lib/python2.7/site-packages/Bio/SearchIO/BlastIO/blast_xml.py", line 236, in __iter__
    for qresult in self._parse_qresult():
  File "/home/antoine/anaconda3/envs/fun-py27/lib/python2.7/site-packages/Bio/SearchIO/BlastIO/blast_xml.py", line 287, in _parse_qresult
    for event, qresult_elem in self.xml_iter:
  File "<string>", line 91, in next
cElementTree.ParseError: mismatched tag: line 84, column 4

It seems that one of the packages has an issue... Any ideas ?

Thank you :)

nextgenusfs commented 5 years ago

I think you are running a version of diamond that has a bug in the xml format, search the issues for diamond xml. (Sorry on my phone)

AntoineHo commented 5 years ago

I checked my Diamond version and indeed there were issues with xml format with v0.9.17 and v0.9.18. After updating to v0.9.22 I now have the following error:

[10:14 AM]: Summary of gene models passed to EVM (weights):
-------------------------------------------------------
Augustus models (1):    22,006
Genemark models (1):    34,100
HiQ models (5):     7,185
Pasa models (1):    0
Total models:       63,291
-------------------------------------------------------
[10:14 AM]: Setting up EVM partitions
[10:16 AM]: Generating EVM command list
[10:16 AM]: Running EVM commands with 9 CPUs
[10:38 AM]: Combining partitioned EVM outputs
[10:38 AM]: Converting EVM output to GFF3
[10:39 AM]: Collecting all EVM results
[10:39 AM]: 33,939 total gene models from EVM
[10:39 AM]: Generating protein fasta files from 33,939 EVM models
[10:40 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
input = predict/predict_misc/evm.round1.proteins.fa
DataBase = /media/antoine/Data2/funannotate-DB
Traceback (most recent call last):
  File "/home/antoine/tools/funannotate-1.4.2/bin/funannotate-predict.py", line 1235, in <module>
    lib.RepeatBlast(EVM_proteins, args.cpus, 1e-10, FUNDB, os.path.join(args.out, 'predict_misc'), Blast_rep_remove)
  File "/home/antoine/tools/funannotate-1.4.2/lib/library.py", line 3471, in RepeatBlast
    with open(blast_tmp, 'rU') as results:
IOError: [Errno 2] No such file or directory: 'predict/predict_misc/repeats.xml'

Should I have this file in the folder before starting funannotate ?

Thank you

nextgenusfs commented 5 years ago

What is the output of funannotate database? If database properly installed then perhaps just delete the blast output and the evm.round1.proteins.fa files from the predict_misc folder and try it again. The logfile should also be giving you more information including the command that is being issued. You can try to run that command that is erring and see if that gives you more information.

The other possibility is that the diamond database needs to be recreated with the new version of diamond. To do that rerun the funannotate setup command using the —force flag.

AntoineHo commented 5 years ago

Hello, it was indeed a diamond database problem, I updated it and the problem was gone. Thank you