Open AndreMonc opened 2 months ago
Hey Andre - that should work, yes. If you have intervals set as "true" in your config (recommended) then you should only need to add it to the "intervals" snakemake.
Off the top of my head, I don't think this will affect any downstream stuff. The annotated filters should all be fine for snps/indels, but you'll mostly need to go off script for filtering the reference sites in the all sites VCF. The QC data should be fine though, as that selects SNPs before running.
We're working on implementing a workaround for all site's VCFs based on the callable sites file (for windowed analyses), but it's still in development. So, this is useful motivation!
Erik
On Sun, Sep 22, 2024 at 6:27 PM Andre E. Moncrieff @.***> wrote:
Hi,
I'm just wanting to check if snpArcher can handle producing all-sites VCFs. From what I can tell, I just need to add the -all-sites flag to the GenotypeGVCFs command in the following two files (or maybe only necessary in the first file?):
- /snpArcher/workflow/rules/bam2vcf_gatk.smk
- /snpArcher/workflow/rules/bam2vcf_gatk_intervals.smk
Does that seem right?
Thanks! Andre
— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/223, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBA3N23X2FAZDEJ5BRKHT3ZX5U7RAVCNFSM6AAAAABOVCMSDSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2DCMZZGEYDOMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks, Erik, for the lightning fast response. Super helpful.
That makes sense. Just to summarize, the current snpArcher pipeline has integrated quality filters for SNPs and indels but not for the invariant sites, so I'll need to make sure I include some additional filters for invariant sites on the back end.
For now, I'm thinking that taking the raw final VCF output from snpArcher, separating the invariant sites and variant sites (as recommended here: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html), doing some additional filtering in bcftools, and then rejoining the VCFs should work to complete the all sites VCF processing pipeline. Sound reasonable?
Great to hear y'all are developing an all-sites feature.
Thanks, Andre
Yep that's how I would approach it. If you want to fully take advantage of the snpArcher filtering, you can run the postprocessing module which will output a "clean_snps.vcf.gz." To do this, after the initial pipeline completes, you add a column to the sample sheet with the name "exclude" then put "Y" for any samples you want to remove (eg after inspecting the QC plots). Then restart the pipeline and the postprocessing module https://github.com/harvardinformatics/snpArcher/blob/main/workflow/modules/postprocess/Snakefilewill run and make the clean SNPs file. Note you can specify a maf filter in the config file.
Then all you need to do is manually make a reference sites only vcf from the "final" vcf, filter it, and merge it with the clean SNPs file for something pixy friendly.
hope this helps! writing this out - it wouldn't be that hard to implement as an option, will keep that in mind... Erik
On Mon, Sep 23, 2024 at 9:05 AM Andre E. Moncrieff @.***> wrote:
Thanks, Erik, for the lightning fast response. Super helpful.
That makes sense. Just to summarize, the current snpArcher pipeline has integrated quality filters for SNPs and indels but not for the invariant sites, so I'll need to make sure I include some additional filters for invariant sites on the back end.
For now, I'm thinking that taking the raw final VCF output from snpArcher, separating the invariant sites and variant sites (as recommended here: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html), doing some additional filtering in bcftools, and then rejoining the VCFs should work to complete the all sites VCF processing pipeline. Sound reasonable?
Great to hear y'all are developing an all-sites feature.
Thanks, Andre
— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/223#issuecomment-2368736092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBA3N4BTLTISKDL63BBHNTZYA36HAVCNFSM6AAAAABOVCMSDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRYG4ZTMMBZGI . You are receiving this because you commented.Message ID: @.***>
A little confused. The "final" or "final raw" vcf refers to the vcf generated by the current snpArcher pipeline or by the changed pipeline with adding the -all-sites flag to the GenotypeGVCFs command? Thanks.
Ah, yeah we used to call it "final" vcf now it is just called "raw." Sorry about that. But yes, the "raw.vcf.gz" is the file that would have invariant sites in it if you added the -all-sites flag.
Erik
On Sun, Oct 20, 2024 at 8:25 AM daicy527 @.***> wrote:
A little confused. The "final" or "final raw" vcf refers to the vcf generated by the current snpArcher pipeline or by the changed pipeline with adding the -all-sites flag to the GenotypeGVCFs command? Thanks.
— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/223#issuecomment-2425044447, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBA3NYHQAVA33HKQKJZI4TZ4PDN7AVCNFSM6AAAAABOVCMSDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRVGA2DINBUG4 . You are receiving this because you commented.Message ID: @.***>
Thank you, Erik.
I have added "-all-sites" into 'bam2vcf_gatk_intervals.smk' and got errors like this:
Error in rule DB2vcf: jobid: 18434 input: results/GCA_013397395.1/genomics_db_import/DB_L0723.tar, results/GCA_013397395.1/data/genome/GCA_013397395.1.fna, results/GCA_013397395.1/data/genome/GCA_013397395.1.fna.fai, results/GCA_013397395.1/data/genome/GCA_013397395.1.dict output: results/GCA_013397395.1/vcfs/intervals/L0723.vcf.gz, results/GCA_013397395.1/vcfs/intervals/L0723.vcf.gz.tbi log: logs/GCA_013397395.1/gatk_genotypegvcfs/0723.txt (check log file(s) for error details) conda-env: /data/snparcher2/snpArcher/.snakemake/conda/9c53882be01c6342dfe89be44d40056f shell:
tar -xf results/GCA_013397395.1/genomics_db_import/DB_L0723.tar
gatk GenotypeGVCFs --java-options '-Xmx29000m -Xms29000m' -R results/GCA_013397395.1/data/genome/GCA_013397395.1.fna --heterozygosity 0.005 --genomicsdb-shared-posixfs-optimizations true -V gendb://results/GCA_013397395.1/genomics_db_import/DB_L0723 -all-sites -O results/GCA_013397395.1/vcfs/intervals/L0723.vcf.gz --tmp-dir /data/snpruntem/ZZXQAENGVYOU/ &> logs/GCA_013397395.1/gatk_genotype_gvcfs/0723.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
=================== 0723.txt
Did I make mistakes? Thanks again.
This looks like a bug in GATK with the all sites
option, see here: https://github.com/broadinstitute/gatk/issues/8415.
There are a few suggested workarounds but it might take a bit of poking around to figure out how to implement them in snpArcher...
Thank you, tsackton.
Ran into this. Need to give the -L
option with the interval file. Here is an example rule for all sites vcf:
rule DB2vcf_allsites:
"""
This rule uses the genomic databases from the previous step (gvcf2DB) to create VCF files, one per list file. Thus, lists
are still scattered.
"""
input:
db = "results/{refGenome}/genomics_db_import/DB_L{l}.tar",
ref = "results/{refGenome}/data/genome/{refGenome}.fna",
fai = "results/{refGenome}/data/genome/{refGenome}.fna.fai",
dictf = "results/{refGenome}/data/genome/{refGenome}.dict",
interval_file = "results/{refGenome}/intervals/db_intervals/{l}-scattered.interval_list"
output:
vcf = temp("results/{refGenome}/vcfs/intervals/L{l}.allsites.vcf.gz"),
vcfidx = temp("results/{refGenome}/vcfs/intervals/L{l}.allsites.vcf.gz.tbi"),
params:
het = config['het_prior'],
db = lambda wc, input: input.db[:-4]
resources:
tmpdir = get_big_temp
log:
"logs/{refGenome}/gatk_genotype_gvcfs_allsites/{l}.txt"
benchmark:
"benchmarks/{refGenome}/gatk_genotype_gvcfs_allsites/{l}.txt"
conda:
"../envs/bam2vcf.yml"
shell:
"""
tar -xf {input.db}
gatk GenotypeGVCFs \
--java-options '-Xmx{resources.mem_mb_reduced}m -Xms{resources.mem_mb_reduced}m' \
-R {input.ref} \
--heterozygosity {params.het} \
--genomicsdb-shared-posixfs-optimizations true \
--include-non-variant-sites \
-L {input.interval_file} \
-V gendb://{params.db} \
-O {output.vcf} \
--tmp-dir {resources.tmpdir} &> {log}
"""
Hi,
I'm just wanting to check in regarding how to obtain an all-sites VCF using snpArcher. From what I can tell, I just need to add the -all-sites flag to the GenotypeGVCFs command in the following file if using an interval approach to variant calling:
/snpArcher/workflow/rules/bam2vcf_gatk_intervals.smk
And do the same for the following file, if not using an interval approach, correct?
/snpArcher/workflow/rules/bam2vcf_gatk.smk
Does that seem right?
Thanks! Andre