nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

CMD ERROR during update using fun annotate-docker #994

Closed bpeacock44 closed 3 months ago

bpeacock44 commented 5 months ago

I am trying to run update following the tutorial, but I ran into a CMD ERROR as seen below, which didn't explicitly state the issue occurring:

$ funannotate-docker update -i ${WDIR}/genomes/04_ARF-L_fun --cpus 19
-------------------------------------------------------
[Jan 09 10:11 PM]: OS: Debian GNU/Linux 10, 20 cores, ~ 132 GB RAM. Python: 3.8.12
[Jan 09 10:11 PM]: Running 1.8.16
[Jan 09 10:11 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[Jan 09 10:11 PM]: Found relevant files in /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training, will re-use them:
        GFF3: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/predict_results/Hyalorbilia_cf._ulicicola_ARF-L.gff3
        Genome: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/predict_results/Hyalorbilia_cf._ulicicola_ARF-L.scaffolds.fa
        Single reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/single.fq.gz
        Forward reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/left.fq.gz
        Reverse reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/right.fq.gz
        Forward Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_left.fastq.gz
        Reverse Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_right.fastq.gz
        Single Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_single.fastq.gz
        Forward normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/left.norm.fq
        Reverse normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/right.norm.fq
        Single normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/single.norm.fq
        Trinity results: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.trinity-GG.fasta
        PASA config file: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/pasa/alignAssembly.txt
        BAM alignments: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.coordSorted.bam
        StringTie GTF: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
[Jan 09 10:11 PM]: Reannotating Hyalorbilia cf. ulicicola, NCBI accession: None
[Jan 09 10:11 PM]: Previous annotation consists of: 16,589 protein coding gene models and 85 non-coding gene models
[Jan 09 10:11 PM]: Existing annotation: locustag=FUN_ genenumber=16674
[Jan 09 10:11 PM]: Converting transcript alignments to GFF3 format
[Jan 09 10:11 PM]: Converting Trinity transcript alignments to GFF3 format
[Jan 09 10:11 PM]: PASA database is SQLite: Hyalorbilia_cf__ulicicola_ARF_L_pasa
[Jan 09 10:11 PM]: Running PASA annotation comparison step 1
[Jan 09 10:11 PM]: CMD ERROR: /venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/pasa/annotCompare.txt -g /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/genome.fa -t /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta.clean -A -L --CPU 19 --annots /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/genome.gff3
--------------

Previously I'd run into issues and ended up using the mysql database when I ran the train command, and it completed successfully. So I added the --pasa_db option on here as well, hoping it would handle the error, but I got a different one:

$ funannotate-docker update -i ${WDIR}/genomes/04_ARF-L_fun \
    --pasa_db mysql --cpus 19
-------------------------------------------------------
[Jan 09 10:17 PM]: OS: Debian GNU/Linux 10, 20 cores, ~ 132 GB RAM. Python: 3.8.12
[Jan 09 10:17 PM]: Running 1.8.16
[Jan 09 10:17 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[Jan 09 10:17 PM]: Found relevant files in /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training, will re-use them:
        GFF3: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/predict_results/Hyalorbilia_cf._ulicicola_ARF-L.gff3
        Genome: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/predict_results/Hyalorbilia_cf._ulicicola_ARF-L.scaffolds.fa
        Single reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/single.fq.gz
        Forward reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/left.fq.gz
        Reverse reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/right.fq.gz
        Forward Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_left.fastq.gz
        Reverse Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_right.fastq.gz
        Single Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_single.fastq.gz
        Forward normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/left.norm.fq
        Reverse normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/right.norm.fq
        Single normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/single.norm.fq
        Trinity results: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.trinity-GG.fasta
        PASA config file: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/pasa/alignAssembly.txt
        BAM alignments: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.coordSorted.bam
        StringTie GTF: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
[Jan 09 10:17 PM]: Reannotating Hyalorbilia cf. ulicicola, NCBI accession: None
[Jan 09 10:17 PM]: Previous annotation consists of: 16,589 protein coding gene models and 85 non-coding gene models
[Jan 09 10:17 PM]: Existing annotation: locustag=FUN_ genenumber=16674
[Jan 09 10:17 PM]: Existing BAM alignments found: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.alignments.bam, /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/transcript.alignments.bam
Traceback (most recent call last):
  File "/venv/bin/funannotate", line 8, in <module>
    sys.exit(main())
  File "/venv/lib/python3.8/site-packages/funannotate/funannotate.py", line 717, in main
    mod.main(arguments)
  File "/venv/lib/python3.8/site-packages/funannotate/update.py", line 3278, in main
    runPASA(
  File "/venv/lib/python3.8/site-packages/funannotate/update.py", line 846, in runPASA
    if not getPASAinformation(configFile, DataBaseName, folder, genome):
  File "/venv/lib/python3.8/site-packages/funannotate/update.py", line 741, in getPASAinformation
    with open(pasaconf_file, "r") as pasaconf:
FileNotFoundError: [Errno 2] No such file or directory: '/venv/opt/pasa-2.4.1/pasa_conf/conf.txt'
nextgenusfs commented 5 months ago

Hi @bpeacock44. I don't think the docker image supports PASA backed mysql, there are permissions issues I believe. The default is sqlite which is single threaded and very slow, but if you are using the docker image I think that is the only method that should be expected to work.

Can you post the contents of your PASA config file, ie /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/pasa/alignAssembly.txt?

bpeacock44 commented 5 months ago

Thank you for the quick response! That is interesting - so when I ran train and specified "--pasa_db mysql," it said it completed successfully. When I ran it without that option, I got a similar CMD ERROR as I pasted above.

Anyways, here is the config file:

## templated variables to be replaced exist as <__var_name__>

# database settings
DATABASE=Hyalorbilia_cf__ulicicola_ARF_L_pasa

#######################################################
# Parameters to specify to specific scripts in pipeline
# create a key = "script_name" + ":" + "parameter"
# assign a value as done above.

#script validate_alignments_in_db.dbi

validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=3
validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=90
validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95

#script subcluster_builder.dbi
subcluster_builder.dbi:-m=50
nextgenusfs commented 5 months ago

So I think the issue is that mysql databases are stored in the OS file system (and not in the working directory), so when you are using the docker image it would potentially work in the container, ie saving the data in the container. But unless you did something special to save that container specific container, when you launch the update script it starts a new container from the base image -- which then doesn't have the mysql/pasa database. Sqlite works differently because the database is stored on the file system in the working directory, so the new container can use that data.

nextgenusfs commented 5 months ago

I'm trying to think if there is a way to short-circuit that result and have it recalculate the PASA database. I don't recall exactly the step points in the script, but it might work to delete the existing PASA results -- then update would need to rebuild the PASA database to run the annotation comparison method. One way to try this might be to duplicate your entire output folder so you have a backup of what was run until this point, and then in one of those two folders try to delete the PASA files and see if funannotate will re-run that step from scratch.

bpeacock44 commented 5 months ago

Got it - that makes sense. So I should go back to the train command and troubleshoot that CMD ERROR I was getting before? Here is the result of that run:

funannotate-docker train -i "${WDIR}/genomes/03_${G}_scaffolds.fasta" -o "${WDIR}/genomes/04_${G}_fun" \
    --left "${WDIR}/rna_data/${G}_S1_R1_001.fastq.gz" \
    --right "${WDIR}/rna_data/${G}_S1_R2_001.fastq.gz" \
    --single "${WDIR}/rna_data/${G}_unp_R1_001.fastq.gz" \
    --jaccard_clip --species "Hyalorbilia cf. ulicicola" \
    --strain ARF-L --cpus 19
#-------------------------------------------------------
#[Jan 08 04:50 PM]: OS: Debian GNU/Linux 10, 20 cores, ~ 132 GB RAM. Python: 3.8.12
#[Jan 08 04:50 PM]: Running 1.8.16
#[Jan 08 04:50 PM]: Combining PE and SE reads supported, but you will lose stranded information, setting --stranded no
#[Jan 08 04:50 PM]: Adapter and Quality trimming PE reads with Trimmomatic
#[Jan 08 05:03 PM]: Adapter and Quality trimming SE reads with Trimmomatic
#[Jan 08 05:05 PM]: Running read normalization with Trinity
#[Jan 08 06:19 PM]: Building Hisat2 genome index
#[Jan 08 06:20 PM]: Aligning reads to genome using Hisat2
#[Jan 08 06:20 PM]: Running genome-guided Trinity, logfile: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/Trinity-gg.log
#[Jan 08 06:20 PM]: Clustering of reads from BAM and preparing assembly commands
#[Jan 08 06:26 PM]: Assembling 23,765 Trinity clusters using 18 CPUs Progress: 23765 complete, 0 failed, 0 remaining
#[Jan 08 07:39 PM]: 23,398 transcripts derived from Trinity
#[Jan 08 07:39 PM]: Running StringTie on Hisat2 coordsorted BAM
#[Jan 08 07:39 PM]: Removing poly-A sequences from trinity transcripts using seqclean
#[Jan 08 07:39 PM]: Converting transcript alignments to GFF3 format
#[Jan 08 07:39 PM]: Converting Trinity transcript alignments to GFF3 format
#[Jan 08 07:39 PM]: Running PASA alignment step using 23,397 transcripts
#CMD ERROR: /venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/pasa/alignAssembly.txt -r -C -R -g /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/genome.fasta --IMPORT_CUSTOM_ALIGNMENTS /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trinity.alignments.gff3 -T -t /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trinity.fasta.clean -u /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --ALT_SPLICE --MAX_INTRON_LENGTH 3000 --CPU 19 --ALIGNERS blat --trans_gtf /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
bpeacock44 commented 5 months ago

Whoops sorry responded before I saw your last comment. One moment and I'll read it through.

bpeacock44 commented 5 months ago

Happy to give it a try - which files should I delete specifically? I have the following folders and I see PASA files in a few of them.

logfiles predict_misc predict_results training update_misc update_results

nextgenusfs commented 5 months ago

Side question (but hopefully easier), it looks like you are at Riverside? Any reason you aren't running this on the HPC cluster instead of docker? Jason should be able to help @hyphaltip?

nextgenusfs commented 5 months ago

Happy to give it a try - which files should I delete specifically? I have the following folders and I see PASA files in a few of them.

logfiles predict_misc predict_results training update_misc update_results

It would be the training/pasa directory. The scripts might detect that there is missing data and re-run it. I don't think I've ever tried this so no guarantees.

bpeacock44 commented 5 months ago

I usually just use HPC if I need more resources than my current computer has. I selected mysql primarily because the person working on this project before me did and I was just referencing their commands while trying to re-run the process. Assuming sqlite takes much more time I will probably go a different route! I'll give it a try and see what happens. Thanks again!

nextgenusfs commented 5 months ago

I build the docker image because folks have requested it. But I don't ever use it because of some limitations, there are so many dependencies and the large sizes of the database make it difficult to build into a docker image properly. I know @hyphaltip has a functioning system on the HPC, so probably could save some time just running it there.

bpeacock44 commented 5 months ago

Got it. I just ran the update after deleting the PASA file and got the same CMD ERROR as before, rather than the second error I got when specifying the mysql database. I can see about using HPC if this will be more trouble than it's worth to figure out.

$ funannotate-docker update -i ${WDIR}/genomes/04_ARF-L_fun \
    --pasa_db mysql --cpus 19
-------------------------------------------------------
[Jan 10 09:47 PM]: OS: Debian GNU/Linux 10, 20 cores, ~ 132 GB RAM. Python: 3.8.12
[Jan 10 09:47 PM]: Running 1.8.16
[Jan 10 09:47 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[Jan 10 09:47 PM]: Found relevant files in /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training, will re-use them:
        GFF3: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/predict_results/Hyalorbilia_cf._ulicicola_ARF-L.gff3
        Genome: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/predict_results/Hyalorbilia_cf._ulicicola_ARF-L.scaffolds.fa
        Single reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/single.fq.gz
        Forward reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/left.fq.gz
        Reverse reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/right.fq.gz
        Forward Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_left.fastq.gz
        Reverse Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_right.fastq.gz
        Single Q-trimmed reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/trimmomatic/trimmed_single.fastq.gz
        Forward normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/left.norm.fq
        Reverse normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/right.norm.fq
        Single normalized reads: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/normalize/single.norm.fq
        Trinity results: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.trinity-GG.fasta
        BAM alignments: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.coordSorted.bam
        StringTie GTF: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
[Jan 10 09:47 PM]: Reannotating Hyalorbilia cf. ulicicola, NCBI accession: None
[Jan 10 09:47 PM]: Previous annotation consists of: 16,589 protein coding gene models and 85 non-coding gene models
[Jan 10 09:47 PM]: Existing annotation: locustag=FUN_ genenumber=16674
[Jan 10 09:47 PM]: Existing BAM alignments found: /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.alignments.bam, /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/transcript.alignments.bam
[Jan 10 09:47 PM]: Running PASA alignment step using 23,397 transcripts
[Jan 10 09:47 PM]: CMD ERROR: /venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/pasa/alignAssembly.txt -r -C -R -g /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/genome.fa --IMPORT_CUSTOM_ALIGNMENTS /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.alignments.gff3 -T -t /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta.clean -u /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --MAX_INTRON_LENGTH 3000 --CPU 19 --ALIGNERS blat --trans_gtf /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
nextgenusfs commented 5 months ago

If there is not an obvious error in the update_misc/pasa/pasa.log file, I think you could run that command directly to hopefully see the error. you'd need to use docker run and then pass the appropriate flags, ie like here https://raw.githubusercontent.com/nextgenusfs/funannotate/master/funannotate-docker

But then your command would be the very long PASA cmd:

/venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/pasa/alignAssembly.txt -r -C -R -g /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/genome.fa --IMPORT_CUSTOM_ALIGNMENTS /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.alignments.gff3 -T -t /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta.clean -u /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/update_misc/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --MAX_INTRON_LENGTH 3000 --CPU 19 --ALIGNERS blat --trans_gtf /mnt/data1/PN97_fungal_genomes/genomes/04_ARF-L_fun/training/funannotate_train.stringtie.gtf
bpeacock44 commented 5 months ago

I'll see if I can get it working on the HPC cluster and revisit this if it becomes the best option again. Thank you!

MichaelFokinNZ commented 5 months ago

my 5c for docker - it is great and worth supporting, because this way you can easily run it on HPC (Singularity in my case) without rigorous installation