samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

should `U` be considered a "valid read base"? #1478

Closed yfarjoun closed 4 years ago

yfarjoun commented 4 years ago

While the SamSpec allows for any character in the regex\*|[A-Za-z=.]+, htsjdk considered only a subset of that to be valid, namely, the IUPAC characters.

The current implementation ignores the 'U' option that may be produced to indicate Uracil as opposed to Thymine.

Should we add "U" to the list of valid IUPAC bases in htsjdk?

lbergelson commented 4 years ago

Do people write U's in RNA bams or do they convert them to T and just know what they are in context?

yfarjoun commented 4 years ago

No idea! I was writing tests for a different PR https://github.com/broadinstitute/picard/pull/1506 and tried adding "all" the IUPAC bases as test cases. since U is technically an IUPAC base, I tried adding it and the htsjdk validator exploded...

yfarjoun commented 4 years ago

hmmm. I think that for now new technologies should use a tag (perhaps MM or similar from https://github.com/samtools/hts-specs/pull/418) instead of this...