snakemake / snakemake-wrappers

This is the development home of the Snakemake wrapper repository, see
https://snakemake-wrappers.readthedocs.io
215 stars 185 forks source link

Problem with the VEP ANNOTATE snakemake wrapper #167

Open moldach opened 4 years ago

moldach commented 4 years ago

Snakemake version

Snakemake

5.23.0

Wrapper

Issue

I'm trying to adapt my regular VEP code to use the snakemake wrapper instead but am running into an issue.

I want to make sure that a) the wrapper works for me and b) it produces the same results as the following:

vep \
        -i {input.sample} \
        --species "caenorhabditis_elegans" \
        --format "vcf" \
        --everything \
        --offline \
        --force_overwrite \
        --fasta {input.ref} \
        --gff {input.annot} \
        --tab \
        --variant_class \
        --regulatory \
        --show_ref_allele \
        --numbers \
        --symbol \
        --protein \
        -o {params.sample}

In order to use VEP with wrappers there are 3 different rules.

I have got the following two working:

# VEP Download Plugins
rule download_vep_plugins:
    output:
        directory("resources/vep/plugins")
    params:
        release=100
    wrapper:
        "0.64.0/bio/vep/plugins"

# VEP Cache
rule get_vep_cache:
    output:
        directory("resources/vep/cache")
    params:
        species="caenorhabditis_elegans",
        build="WBcel235",
        release="100"
    log:
        "logs/vep/cache.log"
    wrapper:
        "0.64.0/bio/vep/cache"

The third rule is to actually run VEP for which I've written the following rule:

rule variant_annotation:
    input:
        calls= lambda wildcards: getVCFs(wildcards.sample),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
        calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
        stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
    params:
        plugins=["LoFtool"],
        extra="--everything"
    message: """--- Annotating Variants."""
    resources:
        mem = 30000,
        time = 120
    threads: 4
    wrapper:
        "0.64.0/bio/vep/annotate"

When submitting the job this is the error I receive:

Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       variant_annotation
        1

[Tue Aug 25 11:23:04 2020]
Job 0: --- Annotating Variants.

Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/470_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpm4v6gdij.wrapper.py", line 44, in <module>
    "(bcftools view {snakemake.input.calls} | "
  File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  (bcftools view VARIANT_CALLING/varscan/470_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cache --cache_version 100 --species caenorhabditis_elegans --assembly WBcel235 --dir_cache resources/vep/cache --dir_plugins resources/vep/plugins --offline --plugin LoFtool --output_file STDOUT --stats_file ANNOTATION/VEP/470.html | bcftools view -Ov > ANNOTATION/VEP/470.annotated.vcf)' returned non-zero exit status 1.
[Tue Aug 25 11:25:02 2020]
Error in rule variant_annotation:
    jobid: 0
    output: ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/470.html
    conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f

RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail;  python /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpm4v6gdij.wrapper.py' returned non-zero exit status 1.
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Removing output files of failed job variant_annotation since they might be corrupted:
ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/470.html
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

This issue was originally brought up on StackOverflow, then on the ensembl-vep repo but seems it should be posted here for a definitive answer.

moldach commented 4 years ago

bcftools view is giving the warning from the output of samtools mpileup/varscan pileup2snp:

def getDeduppedBamsIndex(sample):
  return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))

rule mpilup:
    input:
    bam=lambda wildcards: getDeduppedBams(wildcards.sample),
        reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
    output:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
    log:
        os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log")
    params:
        extra=lambda wc: "-r {}".format(wc.contig)
    resources:
    mem = 1000,
        time = 30
    wrapper:
    "0.65.0/bio/samtools/mpileup"

rule mpileup_to_vcf:
    input:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
    output:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
    message:
    "Calling SNP with Varscan2"
    threads:
    2 # Keep threading value to one for unzipped mpileup input
          # Set it to two for zipped mipileup files
    log:
        os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log")
    resources:
    mem = 1000,
        time = 30
    wrapper:
    "0.65.0/bio/varscan/mpileup2snp"

rule vcf_merge:
    input:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
    output:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}.vcf")
    log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
    resources:
    mem = 1000,
        time = 10
    threads: 1
    message: """--- Merge VarScan by Chromosome."""
    shell: """
    awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input} > {output}
        """

calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))

def getVCFs(sample):
  return(list(os.path.join(callings_dict[sample],"{0}.vcf".format(sample,pair)) for pair in ['']))

rule annotate_variants:
    input:
    calls=lambda wildcards: getVCFs(wildcards.sample),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
    calls="{sample}.annotated.vcf",
        stats="{sample}.html"
    params:
    # Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
        # Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
        plugins=["LoFtool"],
        extra="--everything"  # optional: extra arguments
    log:
        "logs/vep/{sample}.log"
    threads: 4
    resources:
    time=30,
        mem=5000
    wrapper:
    "0.65.0/bio/vep/annotate"

If I run bcftools view on the output I get the error (same for any of the contigs as well so its not the rule merge_vcf) causing the issue:

$ bcftools view variant_calling/varscan/MTG324.vcf 
Failed to read from variant_calling/varscan/MTG324.vcf: unknown file type