WorkflowError - missing file

bchesnut commented 5 days ago

I am getting the following error while running the test script for CADD 1.7:

$ ./CADD.sh -a -g GRCh38 -o ~/tmp/cadd/output_inclAnno_GRCh37.tsv.gz ./test/input.vcf
CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charité - Universitätsmedizin Berlin 2013-2023. All rights reserved.
Running snakemake pipeline:
snakemake /tmp/tmp.Zmoud0EEpB/input.tsv.gz --use-conda --conda-prefix /data/analysis/src/CADD-scripts-1.7/envs/conda --cores 1
--configfile /data/analysis/src/CADD-scripts-1.7/config/config_GRCh38_v1.7.yml --snakefile /data/analysis/src/CADD-scripts-1.7/Snakefile -q
host: vlp-dmpianal06.dhe.duke.edu
WorkflowError in rule join in file /data/analysis/src/CADD-scripts-1.7/Snakefile, line 303:
Failed to open input file: /tmp/tmp.Zmoud0EEpB/input.anno.novel.vcf. Has it been deleted by another process? (rule join, line 612, /data/analysis/src/CADD-scripts-1.7/Snakefile)

Verbose output using -p is attached. cadd-output.txt

I'm running Red Hat EL9 and Miniforge3 conda with snakemake 8.20.3

Thank you in advance for suggestions.

visze commented 3 days ago

What CADD-scripts version you are using? If 1.7 then please use only Snakemake 7.x . For CADD-scripts1.7.1. your Snakemake version should be fine.

(From your command I am pretty sure you use 1.7 and not 1.7.1. so please use the latest cadd scripts version. A simple upgrade of the repo should be fine. No other data needed).

Do you run Snakemake locally or in a cluster environment? Some environments have difficulties to run it in /tmp.

Maybe you don't use the CADD.sh script avoiding the /tmp directory.

Then you have to modify this command:

snakemake /tmp/tmp.Zmoud0EEpB/input.tsv.gz --use-conda --conda-prefix /data/analysis/src/CADD-scripts-1.7/envs/conda --cores 1 --configfile /data/analysis/src/CADD-scripts-1.7/config/config_GRCh38_v1.7.yml --snakefile /data/analysis/src/CADD-scripts-1.7/Snakefile -q

bchesnut commented 3 days ago

@visze Thank you for the comments. I am running CADD 1.7 and following the README.md directions per https://github.com/kircherlab/CADD-scripts, which specify using Snakemake version 8.

I am running the CADD.sh script. I tried setting TMPDIR=~/caddtmp to avoid using /tmp, but getting similar missing file error.

I tried CADD 1.7.1 with different/worse results. README.md mentions using apptainer/singularity, but no specifics.

visze commented 3 days ago

I am very sure you use CADD-scripts v1.7.

For CADD-scripts v1.7.1 your command should look like: https://github.com/kircherlab/CADD-scripts/blob/77df69ac1e23704795d767b0c63d8955924b9838/CADD.sh#L148-L151

But It looks like v1.7: https://github.com/kircherlab/CADD-scripts/blob/203ee3bf3cc6313ebd837a750f1bb21c4c64b326/CADD.sh#L126-L127

Snakemake v1.7 requires snakemake 7.X which is mentioned in it's readme: https://github.com/kircherlab/CADD-scripts/blob/203ee3bf3cc6313ebd837a750f1bb21c4c64b326/README.md?plain=1#L56-L59

You are referring to the latest (CADD-scripts v1.7.1 release) Readme which is correct: there snakemake 8.X sould be used. Apptainer will only work with CADD-scripts v1.7.1 and it is the default in CADD.sh. If you want to disable it use the -m option

Can you show me the I tried CADD 1.7.1 with different/worse results. results?

Please

bchesnut commented 3 days ago

Installed CADD 1.7.1 in /data/workspace/bchesnut/CADD-scripts-1.7.1

Linked /data/workspace/bchesnut/CADD-scripts-1.7.1/data to data location:

$ cd /data/workspace/bchesnut/CADD-scripts-1.7.1
$ rm -rf data
$ ln -s /dmpi/analysis/analysis_data/CADD data

Set some environment variables:

$ export TMPDIR=/data/workspace/bchesnut/tmp
$ export APPTAINER_CACHEDIR=/data/workspace/bchesnut/apptainer

Ran ./install.sh

Ran ./CADD.sh -p -a -g GRCh38 -o ./output_inclAnno_GRCh38.tsv.gz ./test/input.vcf

$ ./CADD.sh -p -a -g GRCh38 -o ./output_inclAnno_GRCh38.tsv.gz ./test/input.vcf
CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charite - Universitatsmedizin Berlin 2013-2024. All rights reserved.
Running snakemake pipeline:
snakemake /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.tsv.gz --sdm conda apptainer --apptainer-prefix /data/workspace/bchesnut/CADD-scripts-1.7.1/envs/apptainer --singularity-args "--bind /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS " --conda-prefix /data/workspace/bchesnut/CADD-scripts-1.7.1/envs/conda --cores 1 --configfile /data/workspace/bchesnut/CADD-scripts-1.7.1/config/config_GRCh38_v1.7.yml --snakefile /data/workspace/bchesnut/CADD-scripts-1.7.1/Snakefile -p
Assuming unrestricted shared filesystem usage.
host: vlp-dmpianal06.dhe.duke.edu
Building DAG of jobs...
Pulling singularity image docker://visze/cadd-scripts-v1_7:0.1.0.
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job         count
--------  -------
join            1
prepare         1
prescore        1
total           3

Select jobs to execute...
Execute 1 jobs...

[Mon Sep 16 13:38:24 2024]
localrule prepare:
    input: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.vcf
    output: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf
    log: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepare.log
    jobid: 2
    reason: Missing output files: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf
    wildcards: file=/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input
    resources: tmpdir=/data/workspace/bchesnut/tmp

        cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.vcf         | python /data/workspace/bchesnut/CADD-scripts-1.7.1/src/scripts/VCF2vepVCF.py         | sort -k1,1 -k2,2n -k4,4 -k5,5         | uniq > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf 2> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepare.log

Activating singularity image /data/workspace/bchesnut/CADD-scripts-1.7.1/envs/apptainer/cbbe741652f49b1cd0ee6ebf25427cc2.simg
Activating conda environment: ../../../../conda-envs/a4fcaaffb623ea8aef412c66280bd623
[Mon Sep 16 13:38:28 2024]
Finished job 2.
1 of 3 steps (33%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Sep 16 13:38:29 2024]
localcheckpoint prescore:
    input: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf, /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno
    output: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.novel.vcf, /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
    log: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log
    jobid: 1
    reason: Missing output files: <TBD>; Input files updated by another job: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf
    wildcards: file=/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input
    resources: tmpdir=/data/workspace/bchesnut/tmp
DAG of jobs will be updated after completion.

        # Prescoring
        echo '## Prescored variant file' > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv 2> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
        PRESCORED_FILES=`find -L /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno -maxdepth 1 -type f -name \*.tsv.gz | wc -l`
        cp /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new
        if [ ${PRESCORED_FILES} -gt 0 ];
        then
            for PRESCORED in $(ls /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno/*.tsv.gz)
            do
                cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new                 | python /data/workspace/bchesnut/CADD-scripts-1.7.1/src/scripts/extract_scored.py --header                     -p $PRESCORED --found_out=/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp                 > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.tmp 2>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
                cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp >> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
                mv /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.tmp /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new &> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
            done;
            rm /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp &>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log
        fi
        mv /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.novel.vcf &>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log

Activating singularity image /data/workspace/bchesnut/CADD-scripts-1.7.1/envs/apptainer/cbbe741652f49b1cd0ee6ebf25427cc2.simg
Activating conda environment: ../../../../conda-envs/a4fcaaffb623ea8aef412c66280bd623
find: ‘/data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno’: No such file or directory
[Mon Sep 16 13:38:29 2024]
Error in rule prescore:
    jobid: 1
    input: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf, /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno
    output: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.novel.vcf, /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
    log: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log (check log file(s) for error details)
    conda-env: /conda-envs/a4fcaaffb623ea8aef412c66280bd623
    shell:

        # Prescoring
        echo '## Prescored variant file' > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv 2> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
        PRESCORED_FILES=`find -L /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno -maxdepth 1 -type f -name \*.tsv.gz | wc -l`
        cp /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new
        if [ ${PRESCORED_FILES} -gt 0 ];
        then
            for PRESCORED in $(ls /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno/*.tsv.gz)
            do
                cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new                 | python /data/workspace/bchesnut/CADD-scripts-1.7.1/src/scripts/extract_scored.py --header                     -p $PRESCORED --found_out=/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp                 > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.tmp 2>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
                cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp >> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
                mv /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.tmp /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new &> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
            done;
            rm /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp &>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log
        fi
        mv /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.novel.vcf &>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job prescore since they might be corrupted:
/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-09-16T132855.935867.snakemake.log
WorkflowError:
At least one job did not complete successfully.

visze commented 3 days ago

Ok. Two things I see. First can you bgzip your input file to input.vcf.gz. but not sure if it change anything.

Second which can be a trouble maker too: Paths have to be correctly set for apptainer images (apptainer command --bind). You have to bind a lot of them extra. E.g. tmp folder,.... Otherwise tmp of singularity image is used and then wiped and next time loaded not there anymore.

I tested it on my end..it worked but you never know on other systems...

So maybe first recommendation is to use only mamba first (-m) flag in the CADD.sh script?

kircherlab / CADD-scripts

WorkflowError - missing file #73