harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
74 stars 33 forks source link

all-sites VCF question #223

Open AndreMonc opened 2 months ago

AndreMonc commented 2 months ago

Hi,

I'm just wanting to check in regarding how to obtain an all-sites VCF using snpArcher. From what I can tell, I just need to add the -all-sites flag to the GenotypeGVCFs command in the following file if using an interval approach to variant calling:

/snpArcher/workflow/rules/bam2vcf_gatk_intervals.smk

And do the same for the following file, if not using an interval approach, correct?

/snpArcher/workflow/rules/bam2vcf_gatk.smk

Does that seem right?

Thanks! Andre

erikenbody commented 2 months ago

Hey Andre - that should work, yes. If you have intervals set as "true" in your config (recommended) then you should only need to add it to the "intervals" snakemake.

Off the top of my head, I don't think this will affect any downstream stuff. The annotated filters should all be fine for snps/indels, but you'll mostly need to go off script for filtering the reference sites in the all sites VCF. The QC data should be fine though, as that selects SNPs before running.

We're working on implementing a workaround for all site's VCFs based on the callable sites file (for windowed analyses), but it's still in development. So, this is useful motivation!

Erik

On Sun, Sep 22, 2024 at 6:27 PM Andre E. Moncrieff @.***> wrote:

Hi,

I'm just wanting to check if snpArcher can handle producing all-sites VCFs. From what I can tell, I just need to add the -all-sites flag to the GenotypeGVCFs command in the following two files (or maybe only necessary in the first file?):

  1. /snpArcher/workflow/rules/bam2vcf_gatk.smk
  2. /snpArcher/workflow/rules/bam2vcf_gatk_intervals.smk

Does that seem right?

Thanks! Andre

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/223, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBA3N23X2FAZDEJ5BRKHT3ZX5U7RAVCNFSM6AAAAABOVCMSDSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2DCMZZGEYDOMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AndreMonc commented 2 months ago

Thanks, Erik, for the lightning fast response. Super helpful.

That makes sense. Just to summarize, the current snpArcher pipeline has integrated quality filters for SNPs and indels but not for the invariant sites, so I'll need to make sure I include some additional filters for invariant sites on the back end.

For now, I'm thinking that taking the raw final VCF output from snpArcher, separating the invariant sites and variant sites (as recommended here: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html), doing some additional filtering in bcftools, and then rejoining the VCFs should work to complete the all sites VCF processing pipeline. Sound reasonable?

Great to hear y'all are developing an all-sites feature.

Thanks, Andre

erikenbody commented 2 months ago

Yep that's how I would approach it. If you want to fully take advantage of the snpArcher filtering, you can run the postprocessing module which will output a "clean_snps.vcf.gz." To do this, after the initial pipeline completes, you add a column to the sample sheet with the name "exclude" then put "Y" for any samples you want to remove (eg after inspecting the QC plots). Then restart the pipeline and the postprocessing module https://github.com/harvardinformatics/snpArcher/blob/main/workflow/modules/postprocess/Snakefilewill run and make the clean SNPs file. Note you can specify a maf filter in the config file.

Then all you need to do is manually make a reference sites only vcf from the "final" vcf, filter it, and merge it with the clean SNPs file for something pixy friendly.

hope this helps! writing this out - it wouldn't be that hard to implement as an option, will keep that in mind... Erik

On Mon, Sep 23, 2024 at 9:05 AM Andre E. Moncrieff @.***> wrote:

Thanks, Erik, for the lightning fast response. Super helpful.

That makes sense. Just to summarize, the current snpArcher pipeline has integrated quality filters for SNPs and indels but not for the invariant sites, so I'll need to make sure I include some additional filters for invariant sites on the back end.

For now, I'm thinking that taking the raw final VCF output from snpArcher, separating the invariant sites and variant sites (as recommended here: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html), doing some additional filtering in bcftools, and then rejoining the VCFs should work to complete the all sites VCF processing pipeline. Sound reasonable?

Great to hear y'all are developing an all-sites feature.

Thanks, Andre

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/223#issuecomment-2368736092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBA3N4BTLTISKDL63BBHNTZYA36HAVCNFSM6AAAAABOVCMSDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRYG4ZTMMBZGI . You are receiving this because you commented.Message ID: @.***>

daicy527 commented 1 month ago

A little confused. The "final" or "final raw" vcf refers to the vcf generated by the current snpArcher pipeline or by the changed pipeline with adding the -all-sites flag to the GenotypeGVCFs command? Thanks.

erikenbody commented 1 month ago

Ah, yeah we used to call it "final" vcf now it is just called "raw." Sorry about that. But yes, the "raw.vcf.gz" is the file that would have invariant sites in it if you added the -all-sites flag.

Erik

On Sun, Oct 20, 2024 at 8:25 AM daicy527 @.***> wrote:

A little confused. The "final" or "final raw" vcf refers to the vcf generated by the current snpArcher pipeline or by the changed pipeline with adding the -all-sites flag to the GenotypeGVCFs command? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/223#issuecomment-2425044447, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBA3NYHQAVA33HKQKJZI4TZ4PDN7AVCNFSM6AAAAABOVCMSDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRVGA2DINBUG4 . You are receiving this because you commented.Message ID: @.***>

daicy527 commented 1 month ago

Thank you, Erik.

I have added "-all-sites" into 'bam2vcf_gatk_intervals.smk' and got errors like this:

Error in rule DB2vcf: jobid: 18434 input: results/GCA_013397395.1/genomics_db_import/DB_L0723.tar, results/GCA_013397395.1/data/genome/GCA_013397395.1.fna, results/GCA_013397395.1/data/genome/GCA_013397395.1.fna.fai, results/GCA_013397395.1/data/genome/GCA_013397395.1.dict output: results/GCA_013397395.1/vcfs/intervals/L0723.vcf.gz, results/GCA_013397395.1/vcfs/intervals/L0723.vcf.gz.tbi log: logs/GCA_013397395.1/gatk_genotypegvcfs/0723.txt (check log file(s) for error details) conda-env: /data/snparcher2/snpArcher/.snakemake/conda/9c53882be01c6342dfe89be44d40056f shell:

    tar -xf results/GCA_013397395.1/genomics_db_import/DB_L0723.tar
    gatk GenotypeGVCFs             --java-options '-Xmx29000m -Xms29000m'             -R results/GCA_013397395.1/data/genome/GCA_013397395.1.fna             --heterozygosity 0.005             --genomicsdb-shared-posixfs-optimizations true             -V gendb://results/GCA_013397395.1/genomics_db_import/DB_L0723             -all-sites             -O results/GCA_013397395.1/vcfs/intervals/L0723.vcf.gz             --tmp-dir /data/snpruntem/ZZXQAENGVYOU/ &> logs/GCA_013397395.1/gatk_genotype_gvcfs/0723.txt

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

=================== 0723.txt

Did I make mistakes? Thanks again.

tsackton commented 1 month ago

This looks like a bug in GATK with the all sites option, see here: https://github.com/broadinstitute/gatk/issues/8415.

tsackton commented 1 month ago

There are a few suggested workarounds but it might take a bit of poking around to figure out how to implement them in snpArcher...

daicy527 commented 1 month ago

Thank you, tsackton.

cademirch commented 1 week ago

Ran into this. Need to give the -L option with the interval file. Here is an example rule for all sites vcf:

rule DB2vcf_allsites:
    """
    This rule uses the genomic databases from the previous step (gvcf2DB) to create VCF files, one per list file. Thus, lists
    are still scattered.
    """
    input:
        db = "results/{refGenome}/genomics_db_import/DB_L{l}.tar",
        ref = "results/{refGenome}/data/genome/{refGenome}.fna",
        fai = "results/{refGenome}/data/genome/{refGenome}.fna.fai",
        dictf = "results/{refGenome}/data/genome/{refGenome}.dict",
        interval_file = "results/{refGenome}/intervals/db_intervals/{l}-scattered.interval_list"
    output:
        vcf = temp("results/{refGenome}/vcfs/intervals/L{l}.allsites.vcf.gz"),
        vcfidx = temp("results/{refGenome}/vcfs/intervals/L{l}.allsites.vcf.gz.tbi"),
    params:
        het = config['het_prior'],
        db = lambda wc, input: input.db[:-4]
    resources:
        tmpdir = get_big_temp
    log:
        "logs/{refGenome}/gatk_genotype_gvcfs_allsites/{l}.txt"
    benchmark:
        "benchmarks/{refGenome}/gatk_genotype_gvcfs_allsites/{l}.txt"
    conda:
        "../envs/bam2vcf.yml"
    shell:
        """
        tar -xf {input.db}
        gatk GenotypeGVCFs \
            --java-options '-Xmx{resources.mem_mb_reduced}m -Xms{resources.mem_mb_reduced}m' \
            -R {input.ref} \
            --heterozygosity {params.het} \
            --genomicsdb-shared-posixfs-optimizations true \
            --include-non-variant-sites \
            -L {input.interval_file} \
            -V gendb://{params.db} \
            -O {output.vcf} \
            --tmp-dir {resources.tmpdir} &> {log}
        """