Closed bpeacock44 closed 3 months ago
Hi @bpeacock44. I don't think the docker image supports PASA backed mysql, there are permissions issues I believe. The default is sqlite which is single threaded and very slow, but if you are using the docker image I think that is the only method that should be expected to work.
Can you post the contents of your PASA config file, ie /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/pasa/alignAssembly.txt
?
Thank you for the quick response! That is interesting - so when I ran train and specified "--pasa_db mysql," it said it completed successfully. When I ran it without that option, I got a similar CMD ERROR as I pasted above.
Anyways, here is the config file:
## templated variables to be replaced exist as <__var_name__>
# database settings
DATABASE=Hyalorbilia_cf__ulicicola_ARF_L_pasa
#######################################################
# Parameters to specify to specific scripts in pipeline
# create a key = "script_name" + ":" + "parameter"
# assign a value as done above.
#script validate_alignments_in_db.dbi
validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=3
validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=90
validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95
#script subcluster_builder.dbi
subcluster_builder.dbi:-m=50
So I think the issue is that mysql databases are stored in the OS file system (and not in the working directory), so when you are using the docker image it would potentially work in the container, ie saving the data in the container. But unless you did something special to save that container specific container, when you launch the update script it starts a new container from the base image -- which then doesn't have the mysql/pasa database. Sqlite works differently because the database is stored on the file system in the working directory, so the new container can use that data.
I'm trying to think if there is a way to short-circuit that result and have it recalculate the PASA database. I don't recall exactly the step points in the script, but it might work to delete the existing PASA results -- then update
would need to rebuild the PASA database to run the annotation comparison method. One way to try this might be to duplicate your entire output folder so you have a backup of what was run until this point, and then in one of those two folders try to delete the PASA files and see if funannotate will re-run that step from scratch.
Got it - that makes sense. So I should go back to the train command and troubleshoot that CMD ERROR I was getting before? Here is the result of that run:
funannotate-docker train -i "${WDIR}/genomes/03_${G}_scaffolds.fasta" -o "${WDIR}/genomes/04_${G}_fun" \
--left "${WDIR}/rna_data/${G}_S1_R1_001.fastq.gz" \
--right "${WDIR}/rna_data/${G}_S1_R2_001.fastq.gz" \
--single "${WDIR}/rna_data/${G}_unp_R1_001.fastq.gz" \
--jaccard_clip --species "Hyalorbilia cf. ulicicola" \
--strain ARF-L --cpus 19
#-------------------------------------------------------
#[Jan 08 04:50 PM]: OS: Debian GNU/Linux 10, 20 cores, ~ 132 GB RAM. Python: 3.8.12
#[Jan 08 04:50 PM]: Running 1.8.16
#[Jan 08 04:50 PM]: Combining PE and SE reads supported, but you will lose stranded information, setting --stranded no
#[Jan 08 04:50 PM]: Adapter and Quality trimming PE reads with Trimmomatic
#[Jan 08 05:03 PM]: Adapter and Quality trimming SE reads with Trimmomatic
#[Jan 08 05:05 PM]: Running read normalization with Trinity
#[Jan 08 06:19 PM]: Building Hisat2 genome index
#[Jan 08 06:20 PM]: Aligning reads to genome using Hisat2
#[Jan 08 06:20 PM]: Running genome-guided Trinity, logfile: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/Trinity-gg.log
#[Jan 08 06:20 PM]: Clustering of reads from BAM and preparing assembly commands
#[Jan 08 06:26 PM]: Assembling 23,765 Trinity clusters using 18 CPUs Progress: 23765 complete, 0 failed, 0 remaining
#[Jan 08 07:39 PM]: 23,398 transcripts derived from Trinity
#[Jan 08 07:39 PM]: Running StringTie on Hisat2 coordsorted BAM
#[Jan 08 07:39 PM]: Removing poly-A sequences from trinity transcripts using seqclean
#[Jan 08 07:39 PM]: Converting transcript alignments to GFF3 format
#[Jan 08 07:39 PM]: Converting Trinity transcript alignments to GFF3 format
#[Jan 08 07:39 PM]: Running PASA alignment step using 23,397 transcripts
#CMD ERROR: /venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/pasa/alignAssembly.txt -r -C -R -g /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/genome.fasta --IMPORT_CUSTOM_ALIGNMENTS /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trinity.alignments.gff3 -T -t /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trinity.fasta.clean -u /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --ALT_SPLICE --MAX_INTRON_LENGTH 3000 --CPU 19 --ALIGNERS blat --trans_gtf /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
Whoops sorry responded before I saw your last comment. One moment and I'll read it through.
Happy to give it a try - which files should I delete specifically? I have the following folders and I see PASA files in a few of them.
logfiles predict_misc predict_results training update_misc update_results
Side question (but hopefully easier), it looks like you are at Riverside? Any reason you aren't running this on the HPC cluster instead of docker? Jason should be able to help @hyphaltip?
Happy to give it a try - which files should I delete specifically? I have the following folders and I see PASA files in a few of them.
logfiles predict_misc predict_results training update_misc update_results
It would be the training/pasa
directory. The scripts might detect that there is missing data and re-run it. I don't think I've ever tried this so no guarantees.
I usually just use HPC if I need more resources than my current computer has. I selected mysql primarily because the person working on this project before me did and I was just referencing their commands while trying to re-run the process. Assuming sqlite takes much more time I will probably go a different route! I'll give it a try and see what happens. Thanks again!
I build the docker image because folks have requested it. But I don't ever use it because of some limitations, there are so many dependencies and the large sizes of the database make it difficult to build into a docker image properly. I know @hyphaltip has a functioning system on the HPC, so probably could save some time just running it there.
Got it. I just ran the update after deleting the PASA file and got the same CMD ERROR as before, rather than the second error I got when specifying the mysql database. I can see about using HPC if this will be more trouble than it's worth to figure out.
$ funannotate-docker update -i ${WDIR}/genomes/04_ARF-L_fun \
--pasa_db mysql --cpus 19
-------------------------------------------------------
[Jan 10 09:47 PM]: OS: Debian GNU/Linux 10, 20 cores, ~ 132 GB RAM. Python: 3.8.12
[Jan 10 09:47 PM]: Running 1.8.16
[Jan 10 09:47 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[Jan 10 09:47 PM]: Found relevant files in /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training, will re-use them:
GFF3: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/predict_results/Hyalorbilia_cf._ulicicola_ARF-L.gff3
Genome: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/predict_results/Hyalorbilia_cf._ulicicola_ARF-L.scaffolds.fa
Single reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/single.fq.gz
Forward reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/left.fq.gz
Reverse reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/right.fq.gz
Forward Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_left.fastq.gz
Reverse Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_right.fastq.gz
Single Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_single.fastq.gz
Forward normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/left.norm.fq
Reverse normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/right.norm.fq
Single normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/single.norm.fq
Trinity results: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.trinity-GG.fasta
BAM alignments: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.coordSorted.bam
StringTie GTF: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
[Jan 10 09:47 PM]: Reannotating Hyalorbilia cf. ulicicola, NCBI accession: None
[Jan 10 09:47 PM]: Previous annotation consists of: 16,589 protein coding gene models and 85 non-coding gene models
[Jan 10 09:47 PM]: Existing annotation: locustag=FUN_ genenumber=16674
[Jan 10 09:47 PM]: Existing BAM alignments found: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.alignments.bam, /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/transcript.alignments.bam
[Jan 10 09:47 PM]: Running PASA alignment step using 23,397 transcripts
[Jan 10 09:47 PM]: CMD ERROR: /venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/pasa/alignAssembly.txt -r -C -R -g /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/genome.fa --IMPORT_CUSTOM_ALIGNMENTS /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.alignments.gff3 -T -t /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta.clean -u /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --MAX_INTRON_LENGTH 3000 --CPU 19 --ALIGNERS blat --trans_gtf /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
If there is not an obvious error in the update_misc/pasa/pasa.log
file, I think you could run that command directly to hopefully see the error. you'd need to use docker run
and then pass the appropriate flags, ie like here https://raw.githubusercontent.com/nextgenusfs/funannotate/master/funannotate-docker
But then your command would be the very long PASA cmd:
/venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/pasa/alignAssembly.txt -r -C -R -g /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/genome.fa --IMPORT_CUSTOM_ALIGNMENTS /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.alignments.gff3 -T -t /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta.clean -u /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --MAX_INTRON_LENGTH 3000 --CPU 19 --ALIGNERS blat --trans_gtf /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
I'll see if I can get it working on the HPC cluster and revisit this if it becomes the best option again. Thank you!
my 5c for docker - it is great and worth supporting, because this way you can easily run it on HPC (Singularity in my case) without rigorous installation
I am trying to run update following the tutorial, but I ran into a CMD ERROR as seen below, which didn't explicitly state the issue occurring:
Previously I'd run into issues and ended up using the mysql database when I ran the train command, and it completed successfully. So I added the --pasa_db option on here as well, hoping it would handle the error, but I got a different one: