harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Question: Can I disable MarkDuplicates? #172

Closed Erythroxylum closed 2 months ago

Erythroxylum commented 2 months ago

Hello, I am interested in using this pipeline with GBS data, but the identical start and end sequences make most reads get flagged as duplicates. Is there a simple way to disable mark duplicates or would it require significant alterations to downstream code? Thanks!

cademirch commented 2 months ago

It might be easier to tell downstream tools (haplotypecaller) to ignore the duplicate flag in the bams. I'll look into this today

erikenbody commented 2 months ago

Do you know how this usually gets handled in GBS pipelines? If the only switch is to remove mark duplicates, it would be easy-ish to have it as an option. In the short term, you could also just replace this line with something like cp {input.bam} {output.dedupBam} 2> {log} then your bam file wouldn't have any duplicates

Erythroxylum commented 2 months ago

GATK is not a usual approach for GBS pipelines, but my understanding is that typically no deduplication is done because there is no good method to detect dups.

On Thu, Apr 11, 2024 at 1:10 PM Erik Enbody @.***> wrote:

Do you know how this usually gets handled in GBS pipelines? If the only switch is to remove mark duplicates, it would be easy-ish to have it as an option. In the short term, you could also just replace this line https://github.com/harvardinformatics/snpArcher/blob/730475ffdb1c0c4729b24e26fb4eb82a54ba3635/workflow/rules/fastq2bam.smk#L49 with something like cp {input.bam} {output.dedupBam} 2> {log} then your bam file wouldn't have any duplicates

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/172#issuecomment-2050140403, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLR5LRIQSIP2ZJX3D5CN5DY427XXAVCNFSM6AAAAABGCQUZUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGE2DANBQGM . You are receiving this because you authored the thread.Message ID: @.***>

cademirch commented 2 months ago

Working on adding this.

cademirch commented 2 months ago

As a temp solution, if you're not using GATK and just need the BAMS w/o duplicates marked, you can try:

snakemake --until dedup --no-temp <... other options>
cademirch commented 2 months ago

This will be added in #173, feel free to pull the branch to test it out

Erythroxylum commented 2 months ago

Hi @cademirch , Wow, thanks for your prompt attention! I have cloned the branch. Can you please explain what flags I need to use to keep dups?

cademirch commented 2 months ago

@Erythroxylum You'll need to set mark_duplicates to False in the config.yaml. Also when you run snakemake, add the --notemp option.

tsackton commented 2 months ago

merged with #173

Erythroxylum commented 2 months ago

Hi Cade et al.: FYI, the mark_duplicates flag is absent from the snpArcher/config/config.yaml. I can, however, see it in the .test/ecoli/config/config.yaml. I just appended it to my main config/config.yaml, hoping that will work.

cademirch commented 2 months ago

Oops. Thanks for pointing that out, I’ll fix soon.

On Mon, Apr 15, 2024 at 08:46 Dawson White @.***> wrote:

Hi Cade et al.: FYI, the mark_duplicates flag is absent from the snpArcher/config/config.yaml. I can, however, see it in the .test/ecoli/config/config.yaml. I just appended it to my main config/config.yaml, hoping that will work.

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/172#issuecomment-2057175948, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVQJ4QTTQMZRXQI7GJI5SDY5PY4VAVCNFSM6AAAAABGCQUZUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJXGE3TKOJUHA . You are receiving this because you were mentioned.Message ID: @.***>