nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

Problem using funannotate update at parsing genbank files #729

Closed JC-therea closed 2 years ago

JC-therea commented 2 years ago

Hello, I am using funannotate update to have a more accurate UTR regions from a model organism using dRNA reads. It seems that does have a problem parsing the GenBank files that funannotate creates:

Command:

funannotate update -f $GENOMEFIX -g $GFF -o Modular \
--nanopore_mrna $dRNA --no_trimmomatic --pasa_db mysql \
--stranded F --jaccard_clip --species "Schizosaccharomyces pombe" \
--cpus 8 --no-progress

output:

[Jun 03 04:36 PM]: OS: CentOS Linux 7, 20 cores, ~ 196 GB RAM. Python: 3.7.12
[Jun 03 04:36 PM]: Running 1.8.9
[Jun 03 04:36 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[Jun 03 04:36 PM]: Previous annotation consists of: 5,132 protein coding gene models and 7,592 non-coding gene models
[Jun 03 04:36 PM]: Trimmomatic will be skipped
[Jun 03 04:36 PM]: Processing long reads: converting to fasta and running SeqClean
[Jun 03 04:37 PM]: Aligning long reads to genome with minimap2
[Jun 03 04:38 PM]: Adding 249,810 unique long-reads to Trinity assemblies
[Jun 03 04:39 PM]: Converting transcript alignments to GFF3 format
[Jun 03 04:40 PM]: Running PASA alignment step using 249,998 transcripts
[Jun 03 11:04 PM]: Running PASA annotation comparison step 1
[Jun 03 11:23 PM]: Running PASA annotation comparison step 2
[Jun 03 11:44 PM]: Generating relative expression values to PASA transcripts
[Jun 03 11:45 PM]: Parsing Kallisto results. Keeping alt-splicing transcripts if expressed at least 10.0% of highest transcript per locus.
[Jun 03 11:45 PM]: Wrote 5,619 transcripts derived from 5,132 protein coding loci.
[Jun 03 11:45 PM]: Validating gene models (renaming, checking translations, filtering, etc)
[Jun 03 11:45 PM]: Writing 6,832 loci to TBL format: dropped 0 overlapping, 20 too short, and 5 frameshift gene models
[Jun 03 11:45 PM]: Converting to Genbank format
[Jun 03 11:46 PM]: Collecting final annotation files
[Jun 03 11:46 PM]: Parsing GenBank files...comparing annotation
-------------------------------------------------------
Traceback (most recent call last):
  File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/bin/funannotate", line 10, in <module>
    sys.exit(main())
  File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/funannotate.py", line 705, in main
    mod.main(arguments)
  File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/update.py", line 2430, in main
    compareAnnotations2(args.gff, final_gbk, Changes, args=args)
  File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/update.py", line 1421, in compareAnnotations2
    newGenes[gene[2]]['mRNA'], hitInfo['mRNA'])
  File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/update.py", line 1559, in pairwiseAED
    splitAED = [pAED[i:i+len(query)] for i in range(0, len(pAED), len(query))]
ValueError: range() arg 3 must not be zero

The final lines of the logFile provide very similar information:

[06/03/22 16:40:06]: Running PASA alignment step using 249,998 transcripts
[06/03/22 16:40:06]: /soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /users/genomics/jmontanes/Funannotate/Modular/update_misc/pasa/alignAssembly.txt -r -C -R -g /users/genomics/jmontanes/Funannotate/Modular/update_misc/genome.fa --IMPORT_CUSTOM_ALIGNMENTS /users/genomics/jmontanes/Funannotate/Modular/update_misc/transcript.alignments.gff3 -T -t /users/genomics/jmontanes/Funannotate/Modular/update_misc/long-reads.fasta.clean -u /users/genomics/jmontanes/Funannotate/Modular/update_misc/long-reads.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --MAX_INTRON_LENGTH 3000 --CPU 8 --ALIGNERS blat --transcribed_is_aligned_orient
[06/03/22 23:04:49]: Running PASA annotation comparison step 1
[06/03/22 23:04:49]: /soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /users/genomics/jmontanes/Funannotate/Modular/update_misc/pasa/annotCompare.txt -g /users/genomics/jmontanes/Funannotate/Modular/update_misc/genome.fa -t /users/genomics/jmontanes/Funannotate/Modular/update_misc/long-reads.fasta.clean -A -L --CPU 8 --annots /users/genomics/jmontanes/Funannotate/Modular/update_misc/genome.gff3
[06/03/22 23:23:35]: Running PASA annotation comparison step 2
[06/03/22 23:23:35]: /soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /users/genomics/jmontanes/Funannotate/Modular/update_misc/pasa/annotCompare.txt -g /users/genomics/jmontanes/Funannotate/Modular/update_misc/genome.fa -t /users/genomics/jmontanes/Funannotate/Modular/update_misc/long-reads.fasta.clean -A -L --CPU 8 --annots /users/genomics/jmontanes/Funannotate/Modular/update_misc/pasa/Schizosaccharomyces_pombe_pasa.gene_structures_post_PASA_updates.225186.gff3
[06/03/22 23:44:58]: copying final PASA GFF3 to output: Modular/update_misc/pasa/Schizosaccharomyces_pombe_pasa.gene_structures_post_PASA_updates.337506.gff3
[06/03/22 23:44:58]: Generating relative expression values to PASA transcripts
[06/03/22 23:44:58]: /soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/opt/pasa-2.4.1/misc_utilities/gff3_file_to_proteins.pl Modular/update_misc/pasa_final.gff3 Modular/update_misc/genome.fa cDNA
[06/03/22 23:45:03]: minimap2 -axmap-ont -t 8 --secondary=no Modular/update_misc/transcripts.fa Modular/update_misc/long-reads.fasta.clean | samtools sort --reference Modular/update_misc/transcripts.fa -@ 2 -o Modular/update_misc/long-reads_transcripts.bam -
[06/03/22 23:45:31]: Parsing Kallisto results. Keeping alt-splicing transcripts if expressed at least 10.0% of highest transcript per locus.
[06/03/22 23:45:31]: Wrote 5,619 transcripts derived from 5,132 protein coding loci.
[06/03/22 23:45:31]: bedtools intersect -sorted -v -a /users/genomics/jmontanes/Funannotate/Modular/update_misc/genome.trna.gff3.sorted.gff3 -b /users/genomics/jmontanes/Funannotate/Modular/update_misc/bestmodels.gff3.sorted.gff3
[06/03/22 23:45:35]: Validating gene models (renaming, checking translations, filtering, etc)
[06/03/22 23:45:37]: Writing 6,832 loci to TBL format: dropped 0 overlapping, 20 too short, and 5 frameshift gene models
[06/03/22 23:45:37]: Converting to Genbank format
[06/03/22 23:45:37]: /soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/bin/python /soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/aux_scripts/tbl2asn_parallel.py -i Modular/update_misc/tbl2asn/genome.tbl -f Modular/update_misc/genome.fa -o Modular/update_misc/tbl2asn --sbt /soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/config/test.sbt -d Modular/update_results/Schizosaccharomyces_pombe.discrepency.report.txt -s Schizosaccharomyces pombe -t -l paired-ends -v 1 -c 8
[06/03/22 23:46:11]: Collecting final annotation files
[06/03/22 23:46:19]: Parsing GenBank files...comparing annotation

I will appreciate any help. O observed a similar question here but in my case, there is not any gene indicated.

Thank you very much for your help.

nextgenusfs commented 2 years ago

It is most likely failing on trying to compare your original GFF annotation and the GBK "updated" annotation. The first GFF3 parser likely skipped a bunch of annotations it didn't understand, but perhaps one or many of those got through to the compareannotations2 function, ie:

[Jun 03 04:36 PM]: Previous annotation consists of: 5,132 protein coding gene models and 7,592 non-coding gene models

So I'm wondering if there is something off with the GFF3 file here, ie 7500 non-coding gene models???

JC-therea commented 2 years ago

Yeah... my model has a lot (I don't know if all of them are correct) of ncRNAs... Maybe if I want to improve the UTR annotation the best way is just to remove the ncRNAs from the original GFF file and keep only the mRNAs. What do you think?

Thank you for your help

JC-therea commented 2 years ago

Finally, I observed that features like ncRNA or tRNA have the field "CDS" instead of "exon". That was probably the problem with my annotation. Thank you very much!

nextgenusfs commented 2 years ago

Glad you got it figured out! GFF is finicky format.