uab-cgds-worthey / DITTO

Variant Deleteriousness prediction tool using AI
GNU General Public License v3.0
1 stars 0 forks source link

Add normalize_vcf and remove_homref_sites to nextflow pipeline #18

Closed sdhutchins closed 1 year ago

sdhutchins commented 1 year ago

The below rules are from a prior snakemake workflow. We want to use them as the first 2 steps in the nextflow pipeline.

Below is the conda config that was used:

channels:
  - conda-forge
  - bioconda
dependencies:
  - bcftools =1.12

Here are the 2 snakemake rules:

rule normalize_vcf:
    input:
        vcf = INTERIM_DIR / "single_sample_vcf" / "{train_test}" / "split" / "{sample}.vcf.gz",
        ref=REF_FASTA,
    output:
        INTERIM_DIR / "single_sample_vcf" / "{train_test}" / "normalized" / "{sample}.vcf.gz"
    message:
        "Normalizing sample: {wildcards.sample} ({wildcards.train_test})"
    conda:
        str(WORKFLOW_PATH / "configs" / "envs" / "bcftools.yaml")
    threads: 2
    shell:
        r"""
        # first split multi-allelic sites and then normalize
        bcftools norm \
                -m-any \
                {input.vcf} \
            | bcftools norm \
                --threads {threads} \
                --check-ref we \
                --fasta-ref {input.ref} \
                -Oz -o {output}
        """

rule remove_homref_sites:
    input:
        INTERIM_DIR / "single_sample_vcf" / "{train_test}" / "normalized" / "{sample}.vcf.gz"
    output:
        INTERIM_DIR / "single_sample_vcf" / "{train_test}" / "homref_removed" / "{sample}.vcf.gz"
    message:
        "Remove homozygous ref sites. Sample: {wildcards.sample} ({wildcards.train_test})"
    conda:
        str(WORKFLOW_PATH / "configs" / "envs" / "bcftools.yaml")
    shell:
        r"""
        bcftools view \
            --include 'GT[*]="alt"' \
            -Oz -o "{output}" \
            "{input}"
        """