harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

update read group naming in common.smk #124

Closed erikenbody closed 9 months ago

erikenbody commented 9 months ago

At the moment, the read group string is set by the "run" column. This means that the same library sequenced on multiple flow cells gets treated separately when duplicates are marked, even though duplicate marking happens after merging reads. The correct usage of the read group would set a library string and then the full RG string would be the same value for one library sequenced across multiple lanes. This should lead to GATK and Sentieon both marking duplicates according to the LibraryName column, while retaining the utility of the run column processing each run separately before bam merging.