kircherlab / CADD-scripts

CADD scripts release for offline scoring. For more information about CADD, please visit our website
http://cadd.gs.washington.edu
Other
74 stars 34 forks source link

Snakemake 8.25.3 + CADD 1.7.2 running into snakemake execution flow order error #78

Closed yangyxt closed 23 hours ago

yangyxt commented 5 days ago

The rule annotate_vep does not run before annotate_esm.

After the checkpoint prescore, the snakemake pipeline directly runs annotate_esm.

It says that inside the snakefile, the input of annotate_vep might need a more dynamic reference syntax like checkpoints.prescore.get(file=wild.file).output.novel to let snakemake identify that this rule is built upon the results from the checkpoint prescore.

Please take a look. Thanks!

fish2022Jul commented 5 days ago

ah, I got the same error as you.

yangyxt commented 5 days ago

I just tried the 1.7.1 with snakemake 8.25.2 and found the same error:

I'm using conda alone for the snakemake pipeline, here are the full log content:

`Assuming unrestricted shared filesystem usage. host: paedyl01 Building DAG of jobs... Your conda installation is not configured to use strict channel priorities. This is however important for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'. Creating conda environment /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/envs/environment_minimal.yml... Downloading and installing remote packages. Environment for /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/envs/environmentminimal.yml created (location: ../../Tools/CADD/CADD-scripts-1.7.1/envs/conda/a2b5c57805b7ab088ae6802ddde5c6cf) Using shell: /usr/bin/bash Provided cores: 5 Rules claiming more threads will be scaled down. Singularity containers: ignored Job stats: job count


decompress 1 join 1 prepare 1 prescore 1 total 4

Select jobs to execute... Execute 1 jobs...

[Thu Nov 21 16:11:53 2024] localrule decompress: input: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.vcf.gz output: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.vcf log: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.decompress.log jobid: 3 reason: Missing output files: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.vcf wildcards: file=/paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr resources: tmpdir=/paedyl01/disk1/yangyxt/test_tmp

    zcat /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.vcf.gz > /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.vcf 2> /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.decompress.log

Activating conda environment: ../../Tools/CADD/CADD-scripts-1.7.1/envs/conda/a2b5c57805b7ab088ae6802ddde5c6cf_ [Thu Nov 21 16:11:53 2024] Finished job 3. 1 of 4 steps (25%) done Select jobs to execute... Execute 1 jobs...

[Thu Nov 21 16:11:53 2024] localrule prepare: input: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.vcf output: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf log: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepare.log jobid: 2 reason: Missing output files: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf; Input files updated by another job: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.vcf wildcards: file=/paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr resources: tmpdir=/paedyl01/disk1/yangyxt/test_tmp

    cat /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.vcf         | python /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/src/scripts/VCF2vepVCF.py         | sort -k1,1 -k2,2n -k4,4 -k5,5         | uniq > /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf 2> /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepare.log

Activating conda environment: ../../Tools/CADD/CADD-scripts-1.7.1/envs/conda/a2b5c57805b7ab088ae6802ddde5c6cf_ [Thu Nov 21 16:11:54 2024] Finished job 2. 2 of 4 steps (50%) done Select jobs to execute... Execute 1 jobs...

[Thu Nov 21 16:11:54 2024] localcheckpoint prescore: input: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf, /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/data/prescored/GRCh37_v1.7/incl_anno output: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.novel.vcf, /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.pre.tsv log: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prescore.log jobid: 1 reason: Missing output files: ; Input files updated by another job: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf wildcards: file=/paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr resources: tmpdir=/paedyl01/disk1/yangyxt/test_tmp DAG of jobs will be updated after completion.

    # Prescoring
    echo '## Prescored variant file' > /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.pre.tsv 2> /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prescore.log;
    PRESCORED_FILES=`find -L /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/data/prescored/GRCh37_v1.7/incl_anno -maxdepth 1 -type f -name \*.tsv.gz | wc -l`
    cp /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf.new
    if [ ${PRESCORED_FILES} -gt 0 ];
    then
        for PRESCORED in $(ls /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/data/prescored/GRCh37_v1.7/incl_anno/*.tsv.gz)
        do
            cat /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf.new                 | python /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/src/scripts/extract_scored.py --header                     -p $PRESCORED --found_out=/paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.pre.tsv.tmp                 > /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf.tmp 2>> /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prescore.log;
            cat /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.pre.tsv.tmp >> /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.pre.tsv
            mv /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf.tmp /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf.new &> /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prescore.log;
        done;
        rm /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.pre.tsv.tmp &>> /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prescore.log
    fi
    mv /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prepared.vcf.new /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.novel.vcf &>> /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.prescore.log

Activating conda environment: ../../Tools/CADD/CADD-scripts-1.7.1/envs/conda/a2b5c57805b7ab088ae6802ddde5c6cf_ [Thu Nov 21 16:19:41 2024] Finished job 1. 3 of 4 steps (75%) done MissingInputException in rule annotate_esm in file /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/Snakefile, line 131: Missing input files for rule annotate_esm: output: /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.esm_missens.vcf.gz, /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.esm_frameshift.vcf.gz, /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.esm.vcf.gz wildcards: file=/paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr affected files: data/annotations/GRCh37_v1.7/esm/esm1v_t33_650M_UR90S_1.pt data/annotations/GRCh37_v1.7/esm/esm1v_t33_650M_UR90S_4.pt data/annotations/GRCh37_v1.7/esm/esm1v_t33_650M_UR90S_2.pt data/annotations/GRCh37_v1.7/esm/pep.110.fa data/annotations/GRCh37_v1.7/esm/esm1v_t33_650M_UR90S_5.pt data/annotations/GRCh37_v1.7/esm/esm1v_t33_650M_UR90S_3.pt

ERROR conda.cli.main_run:execute(125): conda run /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/CADD.sh -c 5 -a -p -m -d -g GRCh37 -o /paedyl01/disk1/yangyxt/test_acmg_auto/TEST_FAM.filtered.anno.cadd.tsv.gz /paedyl01/disk1/yangyxt/test_acmg_auto/TEST_FAM.filtered.anno.nochr.vcf.gz failed. (See above for error) CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charite - Universitatsmedizin Berlin 2013-2024. All rights reserved. Running snakemake pipeline: snakemake /paedyl01/disk1/yangyxt/test_tmp/tmp.9svh5h3ptS/TEST_FAM.filtered.anno.nochr.tsv.gz --sdm conda --conda-prefix /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/envs/conda --cores 5 --configfile /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/config/config_GRCh37_v1.7.yml --snakefile /paedyl01/disk1/yangyxt/Tools/CADD/CADD-scripts-1.7.1/Snakefile -p`

yangyxt commented 4 days ago

ah, I got the same error as you.

Have u found a solution ? It seems annotate_vep cant be the first following the checkpoint prescore no matter how I changed the input vcf in annotate_vep rule. So frustrating.

yangyxt commented 4 days ago

I figured out the issue. It is not the wrong execution flow of composed DAG after the checkpoint. Instead, it is only the missing annotation resources required by annotate_esm.

In the Snakefile, the esm model's path is not absolute path, when specifying the input files, we should add os.environ["CADD"] to specify the parent directory storing all the annotation resources.

Also, the annotate_mmsplice should be made a conditional rule, which should be defined under an if condition checking whether the GenomeBuild is GRCh38.

I'll address these finding issues in a PR (https://github.com/kircherlab/CADD-scripts/pull/80#issue-2681922897) later.

visze commented 4 days ago

Thanks! Good catch! I look at the PR shortly and since CADD-scripts v1.7.2 is very new I will retag it and do not create a new version.

visze commented 3 days ago

can you retry with the latest master branch?

fish2022Jul commented 3 days ago

can you retry with the latest master branch?

I install it with v1.7.2.tar.gz. how to update to the latest one in git hub?

fish2022Jul commented 3 days ago

can you retry with the latest master branch?

Yes, It WORKS with latest Snakemake !!!! Thank you !!!!