nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

No GBK file produced in update step #989

Open tania-k opened 6 months ago

tania-k commented 6 months ago

Funannotate version = funannotate v1.8.16

Hello Funannotate folks, Thanks for producing a wonderful program! I am currently experiencing an issue that I am unsure how to solve. I have progressed through the steps in Funannotate, using mask, train, predict, update, antismash, and iprscan, and am now in the final annotate step but I am receiving an issue running it.

The way I am running the script is: With a lot of variables.

funannotate annotate --sbt $TEMPLATE --busco_db $BUSCO -i $OUTDIR/$BASE --species "$SPECIES" --strain "$STRAIN" --cpus $CPUS $MOREFEATURE $EXTRAANNOT `

The error appears as Histoplasma_capsulatum_1371NJ

[Dec 11 02:46 PM]: OS: Red Hat Enterprise Linux 8.8, 256 cores, ~ 792 GB RAM. Python: 3.8.15
[Dec 11 02:46 PM]: Running 1.8.16
[Dec 11 02:46 PM]: Parsing input files
[Dec 11 02:46 PM]: Existing tbl found: annotate/Histoplasma_capsulatum_1371NJ/update_results/Histoplasma_capsulatum_1371NJ.tbl
[Dec 11 02:46 PM]: Protein FASTA file not found, exiting
-------------------------------------------------------

As I look through my other runs, I noticed that my update step ran just fine in the log files, but it did not produce all the files it usually does (specifically gbk file). I understand that to run this step it would be pulling from the update_results folder. I had the antismash step run pull from the predict_results folder instead as the gbk file was not produced, but I want to understand why, and make sure the 5' and 3' UTRs are detected here by re-running my analysis.

(funannotate) [taniak@longleaf-login5 Annotation]$ ll annotate/Histoplasma_capsulatum_1371NJ/update_results/
total 26898
-rw-r--r-- 1 taniak rc_matutelb_psx  3230200 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.cds-transcripts.fa
-rw-r--r-- 1 taniak rc_matutelb_psx        0 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.discrepency.report.txt
-rw-r--r-- 1 taniak rc_matutelb_psx        0 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.error.summary.txt
-rw-r--r-- 1 taniak rc_matutelb_psx        0 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.gbk
-rw-r--r-- 1 taniak rc_matutelb_psx  2241607 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.gff3
-rw-r--r-- 1 taniak rc_matutelb_psx  3869464 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.mrna-transcripts.fa
-rw-r--r-- 1 taniak rc_matutelb_psx   372076 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.pasa-reannotation.changes.txt
-rw-r--r-- 1 taniak rc_matutelb_psx  1151423 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.proteins.fa
-rw-r--r-- 1 taniak rc_matutelb_psx 15305820 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.scaffolds.fa
-rw-r--r-- 1 taniak rc_matutelb_psx     1614 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.stats.json
-rw-r--r-- 1 taniak rc_matutelb_psx  1368571 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.tbl
-rw-r--r-- 1 taniak rc_matutelb_psx        0 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.validation.txt
-rw-r--r-- 1 taniak rc_matutelb_psx        5 Nov 24 16:41 WGS_accession.txt

The GBK file is empty while the rest of my files are there. Looking at my update log file.

sample is Histoplasma_capsulatum_1371NJ
[Nov 24 04:40 PM]: OS: Red Hat Enterprise Linux 8.8, 72 cores, ~ 791 GB RAM. Python: 3.7.12
[Nov 24 04:40 PM]: Running 1.8.16
[Nov 24 04:40 PM]: Found relevant files in annotate/Histoplasma_capsulatum_1371NJ/training, will re-use them:
    GFF3: annotate/Histoplasma_capsulatum_1371NJ/predict_results/Histoplasma_capsulatum_1371NJ.gff3
    Genome: annotate/Histoplasma_capsulatum_1371NJ/predict_results/Histoplasma_capsulatum_1371NJ.scaffolds.fa
    Forward reads: annotate/Histoplasma_capsulatum_1371NJ/training/left.fq.gz
    Reverse reads: annotate/Histoplasma_capsulatum_1371NJ/training/right.fq.gz
    Forward Q-trimmed reads: annotate/Histoplasma_capsulatum_1371NJ/training/trimmomatic/trimmed_left.fastq.gz
    Reverse Q-trimmed reads: annotate/Histoplasma_capsulatum_1371NJ/training/trimmomatic/trimmed_right.fastq.gz
    Forward normalized reads: annotate/Histoplasma_capsulatum_1371NJ/training/normalize/left.norm.fq
    Reverse normalized reads: annotate/Histoplasma_capsulatum_1371NJ/training/normalize/right.norm.fq
    Trinity results: annotate/Histoplasma_capsulatum_1371NJ/training/funannotate_train.trinity-GG.fasta
    PASA config file: annotate/Histoplasma_capsulatum_1371NJ/training/pasa/alignAssembly.txt
    BAM alignments: annotate/Histoplasma_capsulatum_1371NJ/training/funannotate_train.coordSorted.bam
    StringTie GTF: annotate/Histoplasma_capsulatum_1371NJ/training/funannotate_train.stringtie.gtf
[Nov 24 04:40 PM]: Reannotating Histoplasma_capsulatum, NCBI accession: None
[Nov 24 04:40 PM]: Previous annotation consists of: 3,425 protein coding gene models and 24 non-coding gene models
[Nov 24 04:40 PM]: Existing annotation: locustag=1371NJ_ genenumber=3449
[Nov 24 04:40 PM]: Existing BAM alignments found: annotate/Histoplasma_capsulatum_1371NJ/update_misc/trinity.alignments.bam, annotate/Histoplasma_capsulatum_1371NJ/update_misc/transcript.alignments.bam
[Nov 24 04:40 PM]: Skipping PASA, found existing output: annotate/Histoplasma_capsulatum_1371NJ/update_misc/pasa_final.gff3
[Nov 24 04:40 PM]: Existing Kallisto output found: annotate/Histoplasma_capsulatum_1371NJ/update_misc/kallisto.tsv
[Nov 24 04:40 PM]: Parsing Kallisto results. Keeping alt-splicing transcripts if expressed at least 10.0% of highest transcript per locus.
[Nov 24 04:40 PM]: Wrote 3,587 transcripts derived from 3,433 protein coding loci.
[Nov 24 04:40 PM]: Validating gene models (renaming, checking translations, filtering, etc)
[Nov 24 04:40 PM]: Writing 3,455 loci to TBL format: dropped 0 overlapping, 0 too short, and 0 frameshift gene models
[Nov 24 04:40 PM]: Converting to Genbank format
[Nov 24 04:40 PM]: Collecting final annotation files
[Nov 24 04:41 PM]: Comparing original annotation to updated
 original: annotate/Histoplasma_capsulatum_1371NJ/predict_results/Histoplasma_capsulatum_1371NJ.gff3
 updated: annotate/Histoplasma_capsulatum_1371NJ/update_results/Histoplasma_capsulatum_1371NJ.gff3
[Nov 24 04:41 PM]: Updated annotation complete:
-------------------------------------------------------
Total Gene Models:  3,455
Total transcripts:  3,610
New Gene Models:    11
No Change:      2,568
Update UTRs:        876
Exons Changed:      0
Exons/CDS Changed:  0
Dropped Models:     0
CDS AED:        0.002
mRNA AED:       0.033
-------------------------------------------------------
[Nov 24 04:41 PM]: Funannotate update is finished, output files are in the annotate/Histoplasma_capsulatum_1371NJ/update_results folder
---

The more verbose logfile looks like.

[11/24/23 16:40:45]: /nas/longleaf/home/taniak/.conda/envs/funannotate/bin/funannotate update --cpus 16 -i annotate/Histoplasma_capsulatum_1371NJ --out annotate/Histoplasma_capsulatum_1371NJ --sbt /work/users/t/a/taniak/Annotation/lib/fusarium.sbt --memory 64G

[11/24/23 16:40:45]: OS: Red Hat Enterprise Linux 8.8, 72 cores, ~ 791 GB RAM. Python: 3.7.12
[11/24/23 16:40:45]: Running 1.8.16
[11/24/23 16:40:48]: fasta version=36.3.8g path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/fasta
[11/24/23 16:40:48]: minimap2 version=2.26-r1175 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/minimap2
[11/24/23 16:40:48]: tbl2asn version=25.8 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/tbl2asn
[11/24/23 16:40:48]: hisat2 version=2.2.1 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/hisat2
[11/24/23 16:40:48]: hisat2-build version=NA path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/hisat2-build
[11/24/23 16:40:48]: kallisto version=0.46.1 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/kallisto
[11/24/23 16:40:48]: Trinity version=2.8.5 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/Trinity
[11/24/23 16:40:48]: bedtools version=bedtools v2.31.0 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/bedtools
[11/24/23 16:40:48]: java version=17.0.3-internal path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/java
[11/24/23 16:40:48]: /nas/longleaf/home/taniak/.conda/envs/funannotate/opt/pasa-2.5.3/Launch_PASA_pipeline.pl version=NA path=/nas/longleaf/home/taniak/.conda/envs/funannotate/opt/pasa-2.5.3/Launch_PASA_pipeline.pl
[11/24/23 16:40:48]: /nas/longleaf/home/taniak/.conda/envs/funannotate/opt/pasa-2.5.3/bin/seqclean version=NA path=/nas/longleaf/home/taniak/.conda/envs/funannotate/opt/pasa-2.5.3/bin/seqclean
[11/24/23 16:40:48]: trimmomatic version=0.39 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/trimmomatic
[11/24/23 16:40:48]: minimap2 version=2.26-r1175 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/minimap2
[11/24/23 16:40:48]: blat version=BLAT v37x1 path=/nas/longleaf/home/taniak/.conda/envs/funannotate/bin/blat
[11/24/23 16:40:48]: Found relevant files in annotate/Histoplasma_capsulatum_1371NJ/training, will re-use them:
    GFF3: annotate/Histoplasma_capsulatum_1371NJ/predict_results/Histoplasma_capsulatum_1371NJ.gff3
    Genome: annotate/Histoplasma_capsulatum_1371NJ/predict_results/Histoplasma_capsulatum_1371NJ.scaffolds.fa
    Forward reads: annotate/Histoplasma_capsulatum_1371NJ/training/left.fq.gz
    Reverse reads: annotate/Histoplasma_capsulatum_1371NJ/training/right.fq.gz
    Forward Q-trimmed reads: annotate/Histoplasma_capsulatum_1371NJ/training/trimmomatic/trimmed_left.fastq.gz
    Reverse Q-trimmed reads: annotate/Histoplasma_capsulatum_1371NJ/training/trimmomatic/trimmed_right.fastq.gz
    Forward normalized reads: annotate/Histoplasma_capsulatum_1371NJ/training/normalize/left.norm.fq
    Reverse normalized reads: annotate/Histoplasma_capsulatum_1371NJ/training/normalize/right.norm.fq
    Trinity results: annotate/Histoplasma_capsulatum_1371NJ/training/funannotate_train.trinity-GG.fasta
    PASA config file: annotate/Histoplasma_capsulatum_1371NJ/training/pasa/alignAssembly.txt
    BAM alignments: annotate/Histoplasma_capsulatum_1371NJ/training/funannotate_train.coordSorted.bam
    StringTie GTF: annotate/Histoplasma_capsulatum_1371NJ/training/funannotate_train.stringtie.gtf
[11/24/23 16:40:51]: Reannotating Histoplasma_capsulatum, NCBI accession: None
[11/24/23 16:40:51]: Previous annotation consists of: 3,425 protein coding gene models and 24 non-coding gene models
[11/24/23 16:40:51]: Existing annotation: locustag=1371NJ_ genenumber=3449
[11/24/23 16:40:51]: Input reads: ('annotate/Histoplasma_capsulatum_1371NJ/training/left.fq.gz', 'annotate/Histoplasma_capsulatum_1371NJ/training/right.fq.gz', None)
[11/24/23 16:40:51]: Quality trimmed reads: ('annotate/Histoplasma_capsulatum_1371NJ/training/trimmomatic/trimmed_left.fastq.gz', 'annotate/Histoplasma_capsulatum_1371NJ/training/trimmomatic/trimmed_right.fastq.gz', None)
[11/24/23 16:40:51]: FASTQ headers seem compatible with Trinity
[11/24/23 16:40:51]: Normalized reads: ('annotate/Histoplasma_capsulatum_1371NJ/training/normalize/left.norm.fq', 'annotate/Histoplasma_capsulatum_1371NJ/training/normalize/right.norm.fq', None)
[11/24/23 16:40:51]: Long reads: (None, None, None)
[11/24/23 16:40:51]: Long reads FASTA format: (None, None, None)
[11/24/23 16:40:51]: Long SeqCleaned reads: (None, None, None)
[11/24/23 16:40:51]: Existing BAM alignments found: annotate/Histoplasma_capsulatum_1371NJ/update_misc/trinity.alignments.bam, annotate/Histoplasma_capsulatum_1371NJ/update_misc/transcript.alignments.bam
[11/24/23 16:40:51]: Skipping PASA, found existing output: annotate/Histoplasma_capsulatum_1371NJ/update_misc/pasa_final.gff3
[11/24/23 16:40:51]: Existing Kallisto output found: annotate/Histoplasma_capsulatum_1371NJ/update_misc/kallisto.tsv
[11/24/23 16:40:51]: Parsing Kallisto results. Keeping alt-splicing transcripts if expressed at least 10.0% of highest transcript per locus.
[11/24/23 16:40:51]: Wrote 3,587 transcripts derived from 3,433 protein coding loci.
[11/24/23 16:40:51]: bedtools intersect -sorted -v -a /work/users/t/a/taniak/Annotation/annotate/Histoplasma_capsulatum_1371NJ/update_misc/genome.trna.gff3.sorted.gff3 -b /work/users/t/a/taniak/Annotation/annotate/Histoplasma_capsulatum_1371NJ/update_misc/bestmodels.gff3.sorted.gff3
[11/24/23 16:40:54]: Validating gene models (renaming, checking translations, filtering, etc)
[11/24/23 16:40:54]: Writing 3,455 loci to TBL format: dropped 0 overlapping, 0 too short, and 0 frameshift gene models
[11/24/23 16:40:54]: Converting to Genbank format
[11/24/23 16:40:54]: /nas/longleaf/home/taniak/.conda/envs/funannotate/bin/python /nas/longleaf/home/taniak/.conda/envs/funannotate/lib/python3.7/site-packages/funannotate/aux_scripts/tbl2asn_parallel.py -i annotate/Histoplasma_capsulatum_1371NJ/update_misc/tbl2asn/genome.tbl -f annotate/Histoplasma_capsulatum_1371NJ/update_misc/genome.fa -o annotate/Histoplasma_capsulatum_1371NJ/update_misc/tbl2asn --sbt /work/users/t/a/taniak/Annotation/lib/fusarium.sbt -d annotate/Histoplasma_capsulatum_1371NJ/update_results/Histoplasma_capsulatum_1371NJ.discrepency.report.txt -s Histoplasma_capsulatum -t -l paired-ends -v 1 -c 16 --strain 1371NJ
[11/24/23 16:40:57]: Collecting final annotation files
[11/24/23 16:41:01]: Comparing original annotation to updated
 original: annotate/Histoplasma_capsulatum_1371NJ/predict_results/Histoplasma_capsulatum_1371NJ.gff3
 updated: annotate/Histoplasma_capsulatum_1371NJ/update_results/Histoplasma_capsulatum_1371NJ.gff3
[11/24/23 16:41:44]: Updated annotation complete:
-------------------------------------------------------
Total Gene Models:  3,455
Total transcripts:  3,610
New Gene Models:    11
No Change:      2,568
Update UTRs:        876
Exons Changed:      0
Exons/CDS Changed:  0
Dropped Models:     0
CDS AED:        0.002
mRNA AED:       0.033
-------------------------------------------------------
[11/24/23 16:41:44]: Funannotate update is finished, output files are in the annotate/Histoplasma_capsulatum_1371NJ/update_results folder
---

Any suggestions of where else to look, or if I am missing something, please let me know.

Thank you for your time.

nextgenusfs commented 6 months ago

It seems that tbl2asn failed in the update step (which creates the genbank file from the tbl annotation file). From what you have here, I can't see why it would have failed, especially if it worked in the predict step as you suggested.

When you passed this to annotate, -i $OUTDIR/$BASE, is that this folder (annotate/Histoplasma_capsulatum_1371NJ)?

The script is indeed dying because there is no genbank file, although perhaps this shouldn't be a hard stop. The scripts could generate the necessary output in a different way (and maybe it should).

Ideally you'd like to figure out why tbl2asn failed, you could run that step manually and see if you get an error. Alternative work around is just to run annotate with the FASTA + GFF3 files from update and provide a new output directory, this will regenerate the files it needs and will not rely on the genbank input.

tania-k commented 6 months ago

Hi John, Thanks for taking a look. I noticed in the log file for update that I had the full script to run tbl2asn. I took that and ran it interactively and received a couple of errors.

I think what happened in between predict (where this script worked and created a gbk file), to the update step was an update to python. I was receiving some errors that looked like python errors, and online suggestions were to upgrade or downgrade my python. Which allowed me to progress with the update step but with some key issues.

I think what I am seeing here by adding the quotation marks for "Histoplasma capsulatum" and -t "-l paired-ends" is probably just a syntax issue as I was adjusting the python version. Either way, figured it out and will be re-running my analysis.

Thank you!!

(funannotate) [taniak@longleaf-login5 update_results]$ funannotate util tbl2gbk -i Histoplasma_capsulatum_1371NJ.tbl -f Histoplasma_capsulatum_1371NJ.scaffolds.fa -s Histoplasma capsulatum --strain 1371NJ --tbl2asn -l paired-ends -o Histoplasma_capsulatum_1371NJ
usage: gbk2parts.py [-h] -i TBL -f FASTA -s SPECIES [--isolate ISOLATE] [--strain STRAIN] [-t TBL2ASN] [--sbt SBT] [-o OUTPUT]
gbk2parts.py: error: argument -t/--tbl2asn: expected one argument
(funannotate) [taniak@longleaf-login5 update_results]$ funannotate util tbl2gbk -i Histoplasma_capsulatum_1371NJ.tbl -f Histoplasma_capsulatum_1371NJ.scaffolds.fa -s Histoplasma capsulatum --strain 1371NJ --tbl2asn "-l paired-ends" -o Histoplasma_capsulatum_1371NJ
usage: gbk2parts.py [-h] -i TBL -f FASTA -s SPECIES [--isolate ISOLATE] [--strain STRAIN] [-t TBL2ASN] [--sbt SBT] [-o OUTPUT]
gbk2parts.py: error: unrecognized arguments: capsulatum
(funannotate) [taniak@longleaf-login5 update_results]$ funannotate util tbl2gbk -i Histoplasma_capsulatum_1371NJ.tbl -f Histoplasma_capsulatum_1371NJ.scaffolds.fa -s "Histoplasma capsulatum" --strain 1371NJ --tbl2asn "-l paired-ends" -o Histoplasma_capsulatum_1371NJ
There are 2 gene models that need to be fixed.
-------------------------------------------------------
1371NJ_002402   2 internal stops. Genetic code [1]
(funannotate) [taniak@longleaf-login5 update_results]$ ll
total 86860
-rw-r--r-- 1 taniak rc_matutelb_psx  3230200 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.cds-transcripts.fa
-rw-r--r-- 1 taniak rc_matutelb_psx        0 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.discrepency.report.txt
-rw-r--r-- 1 taniak rc_matutelb_psx  1078529 Dec 12 11:30 Histoplasma_capsulatum_1371NJ.discrepency.txt
-rw-r--r-- 1 taniak rc_matutelb_psx        0 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.error.summary.txt
-rw-r--r-- 1 taniak rc_matutelb_psx 24470243 Dec 12 11:30 Histoplasma_capsulatum_1371NJ.gbk
-rw-r--r-- 1 taniak rc_matutelb_psx  2241607 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.gff3
-rw-r--r-- 1 taniak rc_matutelb_psx  3869464 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.mrna-transcripts.fa
-rw-r--r-- 1 taniak rc_matutelb_psx   372076 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.pasa-reannotation.changes.txt
-rw-r--r-- 1 taniak rc_matutelb_psx  1151423 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.proteins.fa
-rw-r--r-- 1 taniak rc_matutelb_psx 15305820 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.scaffolds.fa
-rw-r--r-- 1 taniak rc_matutelb_psx 35851321 Dec 12 11:30 Histoplasma_capsulatum_1371NJ.sqn
-rw-r--r-- 1 taniak rc_matutelb_psx     1614 Nov 24 16:41 Histoplasma_capsulatum_1371NJ.stats.json
-rw-r--r-- 1 taniak rc_matutelb_psx  1368571 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.tbl
drwxr-sr-x 2 taniak rc_matutelb_psx     4096 Dec 12 11:30 Histoplasma_capsulatum_1371NJ_tmp
-rw-r--r-- 1 taniak rc_matutelb_psx        0 Nov 24 16:40 Histoplasma_capsulatum_1371NJ.validation.txt
-rw-r--r-- 1 taniak rc_matutelb_psx        5 Nov 24 16:41 WGS_accession.txt
hyphaltip commented 4 months ago

I just manually delete genes with these issues usually. did you work around?