samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

CRAM: is it ok that SamTag doesn't return NM and MD tag for SamRecord? #1350

Closed PolinaBevad closed 5 years ago

PolinaBevad commented 5 years ago

Hi all! I know that CRAM file doesn't store NM and MD tags because they can be calculated from reference and read sequence. But what about SAMRecord class? I create SamReader through SamReaderFactory and use it to get SAMRecord, but when I try to get NM tag from SAMRecord for read of CRAM file, it returns null. Also, SAMRecord.getSAMString() returns it without NM and MD tags.

Is there any other way to get NM tag except manually invoking calculateSamNmTag() from SequenceUtil? I thought that SAMRecord will be identical to samtools behavior (samtools shows NM and MD tags for CRAM). Also I read a discussion about BAM-CRAM-BAM conversion and it seems that it must be lossless (https://github.com/samtools/htsjdk/issues/483).

Thank you! Environment: htsjdk 1.19.0, openjdk java 1.8.0.66

cmnbroad commented 5 years ago

@PolinaBevad Whether or not the NM/MD tags are returned through SAMRecord depends on whether they were serialized out in the CRAM you're reading. That in turn depends on the policy of the CRAM writer used to create the CRAM.

Older versions of the htsjdk CRAM writer didn't serialize these tags on write, but more recent versions do if they're present in the SAMRecord at the time its written to CRAM. So I would expect that if you wrote a CRAM using a recent htsjdk, from a BAM that contains NM/MD tags, they would round trip and you should get them back on read. But if they're not present in the CRAM, you'd have to regenerate them.

PolinaBevad commented 5 years ago

Thank you, everything is clear now.