samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
644 stars 174 forks source link

SAMtags does not include BD and BI #202

Open daviesrob opened 7 years ago

daviesrob commented 7 years ago

While looking at some CRAM files that had been through GATK realigner, I noticed that they included BD:Z and BI:Z aux tags. These tags do not appear in the SAMtags specification. This GATK forums thread suggests that there was some talk of adding them back in 2013, but it doesn't look like this ever happened.

Given the popularity of GATK, I would guess that a lot of files include them. There's a good case that they should be added to SAMtags so consumers of these files can understand what the tags mean.

nh13 commented 7 years ago

@daviesrob since both GATK developers rpoplin and carneiro state that they are still unsure if it is worth having the tags (for space reasons), I would instead recommend that GATK transition to using user-reserved tags: X?, Y?, or Z?. I would like to at least see other tools or the like using these tags so it's not tool-specific.

vdauwera commented 7 years ago

Hi @daviesrob, speaking on behalf of the GATK dev team, we've been moving away from using BD and BI (we no longer use them in our production pipelines). We have heard from some users that they're still helpful for some datatypes (IonTorrent I think) that have distinctive indel error modes, but I would expect that's probably not a majority concern. So I'm not convinced it's worth cluttering up the spec with them.

jkbonfield commented 7 years ago

I demonstrated a need for them in one of the other platforms (either IonTorrent or PacBio, but likely both). The reason is that when these values are not present GATK is using a flat prior that is not platform specific. It could be potentially, using the @RG PL field, but it doesn't appear to do so.

Therefore it is necessarily to set BD/BI on those instruments to avoid a huge indel overcall rate. It's possible however to set them to constant values and gain most of the benefit, without suffering the high costs of storing a lot of extra signal. See http://gatkforums.broadinstitute.org/gatk/discussion/9329/bd-bi-tag-usage/p1

If GATK gains options to override the priors on the command line and/or has machine specific priors gleaned from the platform headers, then these fields become unnecessary. I also fully support changing to user-defined tags. But this hasn't happened yet (I believe it was 2013 when this issue was first raised) and we now have a large corpus of existing data with these tags in use which pretty much means we cannot reassign those tags to any other use. Therefore it makes sense to formally document them, even if it is with a note to state they are deprecated.

nh13 commented 7 years ago

@jkbonfield I don't see a link for the demonstrated need, though I conceptually understand why there would be a need. Could you post it for posterity?

This link gives a way to do the recalibration without having to ever write a recalibrated BAM with the BI/BD tags present. I think that is much preferred when using GATK's variant caller. Who else besides the GATK variant caller uses the BI/BD tag?

jkbonfield commented 7 years ago

@nh13 Thanks for the link - that is indeed a useful way of avoiding needing these even for non-Illumina platforms.

However the fact remains we have a large corpus of data in the wild that uses these tags. Could we therefore ever use BI and BD for another purpose? I suspect not because it would simply yield too much confusion and errors due to their presence is existing data, regardless of whether there were better ways on offer.

Perhaps the correct solution therefore is to maintain a list of deprecated but reserved tags so we don't ever feel tempted to reuse them without first realising what issues may crop up. This means we wouldn't need to define them, just list them for posterity.

@vdauwera Does this mean we can expect these tags to be changed to X,Y, Z tags soon in GATK, or perhaps clearer to lowercase "bd:Z:" and "bi:Z:" instead? Are there other non-official tags in use by GATK too?

daviesrob commented 7 years ago

@nh13, @vdauwera The problem here is I'm currently looking at 24709 examples of the ship having sailed long ago. The usage of these tags needs to be documented, if only for historical interest.

vdauwera commented 7 years ago

I'm certainly sympathetic to the need to document the tags and would be happy to provide appropriate documentation.

We may be able to change the tags we emit in the upcoming GATK4 version -- tagging @droazen to chime in on behalf of the GATK4 engine team.

tfenne commented 7 years ago

@daviesrob I'm totally sympathetic to your plight having been there myself many a time. I do think though, that we need to be careful about situations like this. If a vendor creates software that produces BAMs with new "reserved" tags and propogates thousands of BAMs in the wild, will we feel the same? So long as we're willing to apply the same rule to non-Sanger/Broad folks I don't see any reason to object.

FWIW I think this does highlight a real issue with how the tags portion of the spec evolves. I'm going to open up an issue to have a more general discussion on that.

colinhercus commented 7 years ago

Hi,

My 2 cents..

Aside from changing the specs perhaps Picard ValidateSAM and samtools view could give a warning if any undocumented reserved tags are used. This might help reduce future occurrences of this.

It might also be useful to have a central TAG repository where users could "reserve" and document their "user" tags. There are over 2300 possible user defined tag codes which should be enough to avoid conflict.

Cheers, Colin

On 20 April 2017 at 03:58, Tim Fennell notifications@github.com wrote:

@daviesrob https://github.com/daviesrob I'm totally sympathetic to your plight having been there myself many a time. I do think though, that we need to be careful about situations like this. If a vendor creates software that produces BAMs with new "reserved" tags and propogates thousands of BAMs in the wild, will we feel the same? So long as we're willing to apply the same rule to non-Sanger/Broad folks I don't see any reason to object.

FWIW I think this does highlight a real issue with how the tags portion of the spec evolves. I'm going to open up an issue to have a more general discussion on that.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/202#issuecomment-295414884, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-n-CzH2lyJj3Z_pvf8Y3raTgTJvNks5rxmdwgaJpZM4NBvnH .

jkbonfield commented 7 years ago

Agreed. Although it may be wishful thinking - not all user tags for bwa for example are in the man page, so expecting them to be added to some user-tag-wiki (for example) is perhaps hopeful! Still, it's better than nothing and may help spot common usage of tags (XA, X0, etc) that deserve migration to officialdom.

daviesrob commented 7 years ago

@tfenne Yes, it's not ideal. We should be more proactive on reporting bugs against tools that abuse the tag specification in this way. The reason for an exception in this case is just due to the likelihood of these tags being in lots of files in public archives already.

I also agree that updating the tag specification is far too slow at the moment. The apparently endless debate over #109, #119 and #200 shows that the current process really isn't working.

@colinhercus We could check for tag validity, but it would need a machine-readable list of valid tags in a well-known location.

colinhercus commented 7 years ago

For a machine readable file how about a C .h file in htslib? I would only issue warning if a tag was not in the list.

On 20 April 2017 at 19:51, daviesrob notifications@github.com wrote:

@tfenne https://github.com/tfenne Yes, it's not ideal. We should be more proactive on reporting bugs against tools that abuse the tag specification in this way. The reason for an exception in this case is just due to the likelihood of these tags being in lots of files in public archives already.

I also agree that updating the tag specification is far too slow at the moment. The apparently endless debate over #109 https://github.com/samtools/hts-specs/issues/109, #119 https://github.com/samtools/hts-specs/pull/119 and #200 https://github.com/samtools/hts-specs/pull/200 shows that the current process really isn't working.

@colinhercus https://github.com/colinhercus We could check for tag validity, but it would need a machine-readable list of valid tags in a well-known location.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/202#issuecomment-295703209, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-m39xTNNtLEuZqzz40cpNvgwy88qks5rx0argaJpZM4NBvnH .

daviesrob commented 7 years ago

We could embed a list of known tags in htslib, but it would slowly go out of date. It really needs the list to be in a downloadable file so that it can validate tags that have appeared since a particular copy of htslib was installed. The tag list could be cached, so it would only be necessary to check the remote copy if an unknown tag appears in the file being validated.

If we do have a machine-readable file, I would think somewhere in hts-specs would be the obvious place to keep it. That would make it easier to keep it in line with SAMtags.tex.