samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
649 stars 174 forks source link

Representing optical/co-localized duplicates in SAM #121

Open tfenne opened 8 years ago

tfenne commented 8 years ago

@nh13, @jacarey, @jmarshall, @yfarjoun : I'm looking for opinions here. With the popularity of the HiSeq X and HiSeq 4000 which produce a lot of duplication on the sequencer that is geographically co-localized (i.e. pad-hopping duplication) I'm seeing more and more the desire to be able to identify reads that are optical duplicates above and beyond reads that are simply duplicates - i.e. a desire to partition duplicates into PCR duplicates and sequencer/optical duplicates.

I'm implementing some changes to allow Picard's MarkDuplicates to add this extra information, and wondered if anyone would support a change to the specification to standardize how it's stored? The options that I see are:

  1. A defined optional tag - we'd probably need to define a tag and value, since there isn't an optional flag type for boolean or bit.
  2. Consume one of the flag fields

Thoughts? If this isn't supported in the spec I can always use a user-defined tag and value, but I suspect I won't be the only person who cares.

yfarjoun commented 8 years ago

This is very timely.

We are also working on projects that might benefit from this. Though I think that given the importance to a specific type/model of machine, I think that a optional tag is a better route than consuming a bit from the flag. We could even agree on some simple values for the optional tag (DuplicateType->DT?), 'L' for Library Duplicate and 'F' for Flowcell? I think it would be good to decide on the nomenclature here since functions like ROI will have assumptions.

On Tue, Jan 12, 2016 at 10:29 AM, Tim Fennell notifications@github.com wrote:

@nh13 https://github.com/nh13, @jacarey https://github.com/jacarey, @jmarshall https://github.com/jmarshall, @yfarjoun https://github.com/yfarjoun : I'm looking for opinions here. With the popularity of the HiSeq X and HiSeq 4000 which produce a lot of duplication on the sequencer that is geographically co-localized (i.e. pad-hopping duplication) I'm seeing more and more the desire to be able to identify reads that are optical duplicates above and beyond reads that are simply duplicates - i.e. a desire to partition duplicates into PCR duplicates and sequencer/optical duplicates.

I'm implementing some changes to allow Picard's MarkDuplicates to add this extra information, and wondered if anyone would support a change to the specification to standardize how it's stored? The options that I see are:

  1. A defined optional tag - we'd probably need to define a tag and value, since there isn't an optional flag type for boolean or bit.
  2. Consume one of the flag fields

Thoughts? If this isn't supported in the spec I can always use a user-defined tag and value, but I suspect I won't be the only person who cares.

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/121.

tfenne commented 8 years ago

I'm open to either - but I should point out that all Illumina machines have produced optical duplicates, all the way back to the GAs, so it's not just one model of machine. And presumably other technologies might have similar problems (I don't know). I don't think it would be too much of a stretch to use a flag bit to discriminate between "library" and "sequencer" duplicates if described in a generic way.

That said, I'm happy to go with a tag if you and others think that's best @yfarjoun.

yfarjoun commented 8 years ago

Yes. this has been a problem in the past. However,

  1. Not enough of a problem to require this change (in the past...).
  2. It is not clear that the discrimination is useful to all users.

So my thinking is that until we know that most users would benefit from the knowledge, we spare the ones who are unconvinced of the benefits from the "cost". The main cost of the bit is that they are finite and we might have better uses for them in the future and not want to have incompatibility issues. This is mostly me being cautious rather than out-right objecting to the bit option. If others feel that co-localized dups are here to stay and that they are worth a bit in everyone's SAM records, I not going to fight.

for the sake of having the discussion, I see several benefits of knowing whether the duplicate is library or co-localized:

  1. Enables better Library-size/ROI calculations (or at-least better post-duplicate-marking analysis)
  2. Allows better QC when working with molecular barcodes (as co-localized reads are "definitely" duplicates, no matter what the barcodes say)

For actual analysis though, I'm not sure what the benefit of the distinction is, and so I can envision a pipeline that generates the DT tag during development and throws it away during production (which is my current "argument" for keeping it a tag....)

On Tue, Jan 12, 2016 at 11:48 AM, Tim Fennell notifications@github.com wrote:

I'm open to either - but I should point out that all Illumina machines have produced optical duplicates, all the way back to the GAs, so it's not just one model of machine. And presumably other technologies might have similar problems (I don't know). I don't think it would be too much of a stretch to use a flag bit to discriminate between "library" and "sequencer" duplicates if described in a generic way.

That said, I'm happy to go with a tag if you and others think that's best @yfarjoun https://github.com/yfarjoun.

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/121#issuecomment-170971358.

lh3 commented 8 years ago

I prefer a new tag because for the vast majority of applications, we don't care if the duplicate is optical or PCR. It seems too costly to consume an invaluable bit (we only have four left) for the few applications that want to distinguish the two types of duplicates.

nh13 commented 8 years ago

I favor having an optional tag, so we can have multiple values for the tag (ex. pcr, optical, ...). I envision a mode (in production) where we measure the optical duplicate rate, and in case it is "high", we can re-compute and mark the optical duplicates for downstream diagnosis. This makes it an opt-in model, which consumes no more bytes unless we want to. I agree with @lh3 that we should be conservative about using one of the four remaining flag fields.

tfenne commented 8 years ago

Ok, sounds like we have consensus that it should be a new tag with some amount of flexibility. I'm going to start using DT for duplicate type as the extra attribute, and once #113 is merged I will put together a pull request to specify it in more detail.

dkj commented 8 years ago

Curious about the detail: @yfarjoun what tag values are you planning on? Given 4 templates A,B,C,D with alignments such that we think they are all duplicates with A spatially adjacent to B in the same flowcell lane, C in the same flowcell lane but spatially separated from A and B by other templates, and D from a different flowcell lane? ( presuming DT should be marked only on those with the 0x400 duplicate bit set rather than on all templates in a set where duplicates have been found)

A 0
B 1 DT:optical
C 1 DT:flowcell
D 1 DT:library

and if D has slightly better quality values than A, than C, than B:

A 1 DT:library
B 1 DT:optical
C 1 DT:flowcell
D 0 

and if D has slightly better quality values than C, than A, than B:

A 1 DT:flowcell
B 1 DT:optical
C 1 DT:library
D 0 

Are you considering marking duplicates within a flowcell lane differently from those found across different lanes? Thinking of the Illumina ordered flowcells of HiSeq{X,{4,3}00} are the cluster generation duplicates manifested purely in adjacent pad hopping? (I don't know...).

tfenne commented 8 years ago

@dkj I'm not @yfarjoun but I'll give you an answer anyway. I think there are only two kinds of duplicates that we currently know/care to differentiate between:

  1. Those [likely] caused by amplification in library preparation
  2. Those that are likely caused by clonal amplification and sequencing

For a long time we've referred to 2 as "optical duplicates", but with the Illumina ordered flowcells have realized that pad-hopping presents almost identically to optical duplicates, but from a totally different mechanism. As such I think we're just struggling to come up with a good name for class 2, since "optical" no longer works. I think @yfarjoun was suggesting flowcell as the replacement, but I don't like that as it is somewhat Illumina specific. My suggestion would be "Library"/"LB" and "Sequencing"/SQ as the two two kinds.

yfarjoun commented 8 years ago

I'm with agreement with @tfenne, I thought that flowcell is general enough...but sequencing works too.

Y.

On Fri, Jan 22, 2016 at 6:42 AM, dkj notifications@github.com wrote:

Curious about the detail: @yfarjoun https://github.com/yfarjoun what tag values are you planning on? Given 4 templates A,B,C,D with alignments such that we think they are all duplicates with A spatially adjacent to B in the same flowcell lane, C in the same flowcell lane but spatially separated from A and B by other templates, and D from a different flowcell lane? ( presuming DT should be marked only on those with the 0x400 duplicate bit set rather than on all templates in a set where duplicates have been found)

A 0 B 1 DT:optical C 1 DT:flowcell D 1 DT:library

and if D has slightly better quality values than A, than C, than B:

A 1 DT:library B 1 DT:optical C 1 DT:flowcell D 0

and if D has slightly better quality values than C, than A, than B:

A 1 DT:flowcell B 1 DT:optical C 1 DT:library D 0

Are you considering marking duplicates within a flowcell lane differently from those found across different lanes? Thinking of the Illumina ordered flowcells of HiSeq{X,{4,3}00} are the cluster generation duplicates manifested purely in adjacent pad hopping? (I don't know...).

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/121#issuecomment-173894101.

jmarshall commented 6 years ago

The SAMtags document (PR #113) has been split out for a while now (see @tfenne's comment). Is this DT still something that people want to have represented?

yfarjoun commented 6 years ago

We've put this to use in Picard and are populating DT on demand (with values LB and SQ for "Library" and "Sequencing" based duplication.)

On Thu, May 3, 2018 at 5:54 PM, John Marshall notifications@github.com wrote:

The SAMtags document (PR #113 https://github.com/samtools/hts-specs/pull/113) has been split out for a while now (see @tfenne's comment https://github.com/samtools/hts-specs/issues/121#issuecomment-173585298). Is this DT still something that people want to have represented?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/121#issuecomment-386449211, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0hxPXj68xflJfDwiUAJ74iXAZcuHks5tu3yxgaJpZM4HDPHC .

jkbonfield commented 6 years ago

It sounds like people want it, in which case I have no objection. I'd go with codifying what's already being done. @tfenne have you got any code writing this already too? (I'm hoping not, unless by luck/design the two implementations match!)

tfenne commented 6 years ago

@jkbonfield I'm 99% sure I don't, and certainly not any that is anywhere but in a back pocket.

PedalheadPHX commented 5 years ago

@jkbonfield Assuming this is still an open issue, I just want to a a vote to an interest in this, we have moved to using samtools markdup -S and would like to have an estimate of the optical/flowcell/sequencing duplicates from PCR/randomIdenticalFragment duplicates. This in particular with patterned flowcells can help the people loading the flowcells learn if their clustering input is optimal or not as the percentage of active wells is not very helpful