stjudecloud / workflows

Bioinformatics workflows developed for and used on the St. Jude Cloud project.
MIT License
34 stars 10 forks source link

feat: GATK MarkDuplicatesSpark #153

Closed adthrasher closed 5 months ago

adthrasher commented 5 months ago

Adds a tool definition for GATK's MarkDuplicates that is Spark-enabled.

Two options (remove-all-duplicates and remove-sequencing-duplicates) are intentionally omitted. They cannot be specified (even if false) if duplicate-tagging-policy is specified.

adthrasher commented 5 months ago

Is there no md5 created by the spark version?

Nope. No idea why they dropped that. It produces an index, which is presumably single-threaded. Computing an MD5 shouldn't be that hard.