Open DaRinker opened 1 year ago
UPDATE: I re-ran my script (using whatever pre-existing output funannotate could find) and everything seemed to work just fine (??).
Starting singularity...
Funannotate-prepped genome file found..
##########
Running command:
funannotate train -i DTO4.secondpolish.funsorted.masked.fasta -o funannotate.out -l /path/to/my/illumina.rna.reads/treatment1_R1.fastq.gz -r /path/to/my/illumina.rna.reads/treatment1_R2.fastq.gz --cpus 20
#########
[Mar 15 07:53 AM]: OS: Debian GNU/Linux 10, 256 cores, ~ 792 GB RAM. Python: 3.8.12
[Mar 15 07:53 AM]: Running 1.8.14
[Mar 15 07:53 AM]: 24,150 existing Trinity results found: funannotate.out/training/trinity.fasta
[Mar 15 07:53 AM]: Removing poly-A sequences from trinity transcripts using seqclean
[Mar 15 07:53 AM]: Existing SeqClean output found: funannotate.out/training/funannotate.out/training/trinity.fasta.clean
[Mar 15 07:53 AM]: Existing BAM alignments found: funannotate.out/training/trinity.alignments.bam, funannotate.out/training/transcript.alignments.bam
[Mar 15 07:53 AM]: Existing PASA assemblies found: funannotate.out/training/pasa/DTO4_secondpolish_funsorted_masked_pasa.assemblies.fasta
[Mar 15 07:53 AM]: PASA assigned 24,256 transcripts to 16,567 loci (genes)
[Mar 15 07:53 AM]: Getting PASA models for training with TransDecoder
[Mar 15 08:03 AM]: PASA finished. PASAweb accessible via: localhost:port/cgi-bin/index.cgi?db=/path/to/my/assembly/fasta/funannotate.out/training/p
asa/DTO4_secondpolish_funsorted_masked_pasa
[Mar 15 08:03 AM]: Using Kallisto TPM data to determine which PASA gene models to select at each locus
[Mar 15 08:03 AM]: Building Kallisto index
[Mar 15 08:04 AM]: Mapping reads using pseudoalignment in Kallisto
[Mar 15 08:10 AM]: Parsing expression value results. Keeping best transcript at each locus.
[Mar 15 08:11 AM]: Wrote 8,927 PASA gene models
[Mar 15 08:11 AM]: PASA database name: DTO4.secondpolish.funsorted.masked
[Mar 15 08:11 AM]: Trinity/PASA has completed, you are now ready to run funanotate predict, for example:
funannotate predict -i DTO4.secondpolish.funsorted.masked.fasta \
-o funannotate.out -s "DTO4.secondpolish.funsorted.masked" --cpus 20
-------------------------------------------------------
-------------------------------------------------------
##########
Running command:
funannotate predict -i DTO4.secondpolish.funsorted.masked.fasta --species "my_species" --isolate DTO4 --transcript_evidence funannotate.out/training/funannotate_train.trinity-GG.fasta --rna_bam funannot
ate.out/training/funannotate_train.coordSorted.bam --pasa_gff funannotate.out/training/funannotate_train.pasa.gff3 --out funannotate.out
#########
-------------------------------------------------------
[Mar 15 08:11 AM]: OS: Debian GNU/Linux 10, 256 cores, ~ 792 GB RAM. Python: 3.8.12
[Mar 15 08:11 AM]: Running funannotate v1.8.14
[Mar 15 08:11 AM]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction.
[Mar 15 08:11 AM]: Found training files, will re-use these files:
--stringtie funannotate.out/training/funannotate_train.stringtie.gtf
--transcript_alignments funannotate.out/training/funannotate_train.transcripts.gff3
[Mar 15 08:11 AM]: Parsed training data, run ab-initio gene predictors as follows:
Program Training-Method
augustus pasa
codingquarry rna-bam
glimmerhmm pasa
snap pasa
[Mar 15 08:11 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
[Mar 15 08:11 AM]: Genome loaded: 24 scaffolds; 31,699,275 bp; 2.56% repeats masked
[Mar 15 08:11 AM]: Parsed 16,285 transcript alignments from: funannotate.out/training/funannotate_train.transcripts.gff3
[Mar 15 08:12 AM]: Aligning 7,865 unique transcripts [not found in exising alignments] with minimap2
[Mar 15 08:12 AM]: Mapped 0 of these transcripts to the genome
[Mar 15 08:12 AM]: Creating transcript EVM alignments and Augustus transcripts hintsfile
[Mar 15 08:12 AM]: Extracting hints from RNA-seq BAM file using bam2hints
[Mar 15 08:12 AM]: Mapping 555,918 proteins to genome using diamond and exonerate
Could problem be with my shell script? This is how I have it set up:
######################################
cat <<EOF
##########
Running command:
funannotate train -i ${genomefasta%.fa*}.funsorted.masked.fasta \
-o funannotate.out \
-l ${rnaseqreadspath}/${fwdrnaseq} \
-r ${rnaseqreadspath}/${revrnaseq} \
--cpus 20
#########
EOF
funannotate train -i ${genomefasta%.fa*}.funsorted.masked.fasta \
-o funannotate.out \
-l ${rnaseqreadspath}/${fwdrnaseq} \
-r ${rnaseqreadspath}/${revrnaseq} \
--cpus 20
if [ $? -eq 0 ]
then
echo "Successfully completed 'funannotate train' on RNAseq data" >&2
else
echo "Could not complete 'funannotate train'. Exiting..." >&2
exit 1
fi
cat <<EOF
##########
Running command:
funannotate predict -i ${genomefasta%.fa*}.funsorted.masked.fasta \
--species "aspergillus_fischeri" --isolate ${sample} \
--transcript_evidence funannotate.out/training/funannotate_train.trinity-GG.fasta \
--rna_bam funannotate.out/training/funannotate_train.coordSorted.bam \
--pasa_gff funannotate.out/training/funannotate_train.pasa.gff3 \
--out funannotate.out
#########
EOF
funannotate predict -i ${genomefasta%.fa*}.funsorted.masked.fasta \
--species "aspergillus_fischeri" --isolate ${sample} \
--transcript_evidence funannotate.out/training/funannotate_train.trinity-GG.fasta \
--rna_bam funannotate.out/training/funannotate_train.coordSorted.bam \
--pasa_gff funannotate.out/training/funannotate_train.pasa.gff3 \
--out funannotate.out
you are running the sqlite version to generate the db - might be that the path loaded is diff when you run interactively vs direct?
might be good to test again with a clean folder if you want to try to make sure this won't be back on your next annotation.
might be that the path loaded is diff when you run interactively vs direct?
Interesting. I would have guessed that the same version would be used since both are run inside the container. Not exactly sure how to remedy.
On Wed, Mar 15, 2023 at 12:48 PM Jason Stajich @.***> wrote:
you are running the sqlite version to generate the db - might be that the path loaded is diff when you run interactively vs direct?
might be good to test again with a clean folder if you want to try to make sure this won't be back on your next annotation.
— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/882#issuecomment-1470389320, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4HAMKY24JVVERLLU72TY3W4HXGLANCNFSM6AAAAAAV3YRPLQ . You are receiving this because you authored the thread.Message ID: @.***>
As an update, this is still a problem for me. However, I have since discovered that I can also get the annotation to complete simply by resubmitting the same job to the cluster.
For example, this is my failed job:
[Mar 18 09:12 AM]: Running StringTie on Hisat2 coordsorted BAM
[Mar 18 09:12 AM]: Removing poly-A sequences from trinity transcripts using seqclean
[Mar 18 09:13 AM]: Converting transcript alignments to GFF3 format
[Mar 18 09:13 AM]: Converting Trinity transcript alignments to GFF3 format
[Mar 18 09:13 AM]: Running PASA alignment step using 28,314 transcripts
[Mar 18 11:41 PM]: CMD ERROR: /venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c path/to/assembly/DTO1/pilonresults/funannotate.out/training/pasa/alignAssembly.txt -r -C -R -g path/to/assembly/DTO1/pilonresults/funannotate.out/training/genome.fasta --IMPORT_CUSTOM_ALIGNMENTS path/to/assembly/DTO1/pilonresults/funannotate.out/training/trinity.alignments.gff3 -T -t path/to/assembly/DTO1/pilonresults/funannotate.out/training/trinity.fasta.clean -u path/to/assembly/DTO1/pilonresults/funannotate.out/training/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --ALT_SPLICE --MAX_INTRON_LENGTH 3000 --CPU 20 --ALIGNERS blat --trans_gtf path/to/assembly/DTO1/pilonresults/funannotate.out/training/funannotate_train.stringtie.gtf
Which I can then relaunch:
[Mar 19 09:17 AM]: OS: Debian GNU/Linux 10, 256 cores, ~ 1056 GB RAM. Python: 3.8.12
[Mar 19 09:17 AM]: Running 1.8.14
[Mar 19 09:17 AM]: 28,314 existing Trinity results found: funannotate.out/training/trinity.fasta
[Mar 19 09:17 AM]: Removing poly-A sequences from trinity transcripts using seqclean
[Mar 19 09:17 AM]: Existing SeqClean output found: funannotate.out/training/funannotate.out/training/trinity.fasta.clean
[Mar 19 09:17 AM]: Existing BAM alignments found: funannotate.out/training/trinity.alignments.bam, funannotate.out/training/transcript.alignments.bam
[Mar 19 09:17 AM]: Existing PASA assemblies found: funannotate.out/training/pasa/DTO1_secondpolish_funsorted_masked_pasa.assemblies.fasta
[Mar 19 09:17 AM]: PASA assigned 25,684 transcripts to 13,715 loci (genes)
[Mar 19 09:17 AM]: Getting PASA models for training with TransDecoder
[Mar 19 09:30 AM]: PASA finished. PASAweb accessible via: localhost:port/cgi-bin/index.cgi?db=path/to/assembly/DTO1/pilonresults
/funannotate.out/training/pasa/DTO1_secondpolish_funsorted_masked_pasa
[Mar 19 09:30 AM]: Using Kallisto TPM data to determine which PASA gene models to select at each locus
[Mar 19 09:30 AM]: Building Kallisto index
[Mar 19 09:32 AM]: Mapping reads using pseudoalignment in Kallisto
[Mar 19 09:34 AM]: Parsing expression value results. Keeping best transcript at each locus.
So my workflow now is 1) submit job 2) wait for job to crash with PASA the error documented above 3) RE-submit the same job script 4) Wait for job to complete successfully.
However, I am now concerned that the PASA output being used in the second attempt (25,684 transcripts ) doesn't match the input (28,314 transcripts) from the first (failed) attempt. I had been assuming that the number of transcripts "assigned" should be a subset of the starting number so was okay with the discrepancy...but now I'm feeling like I might be assuming too much
Are you using the latest release? yes. running docker as singularity
Describe the bug Trying to annotate a large number of genomes on local cluster. Using Singularity image with supporting .sh script The command
funannotate train
Regularly fails when running as a job on the cluster. The error is always:HOWEVER, if I run the same command (below) interactively, it all completes as it should. The failures occur across all compute nodes. I'm wondering if other processes on the compute nodes can interfere with funannotate? I've wondering if there's some problem with the way I'm specifying the number of cores?--is there something about the PASA step specifically that might consistently be "calling me out" on this??
What command did you issue?
Logfiles
OS/Install Information
funannotate check --show-versions