samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
279 stars 244 forks source link

CG tag has the incorrect type. #1560

Open jkbonfield opened 3 years ago

jkbonfield commented 3 years ago

Description of the issue:

Using picard SamFormatConverter to turn a long-read CRAM to BAM with more than 65536 cigar ops generates a CG:B:i, tag. The SAM specification dictates this must be CG:B:I,. This may be overly picky, but samtools validates against and thus rejects this data. Obviously we can make Samtools more lenient, but it would also be good to correct the issue at source as it is generating malformed data that will cause issues for older copies of samtools.

Your environment:

Whatever version of htsjdk is bundled in Picard 2.25.7. (I'm assuming this issue is part of htsjdk and not picard.) openjdk 11.0.11 Ubuntu 18.04.5 LTS (bionic)

Steps to reproduce

See the file reported in https://github.com/samtools/samtools/issues/1477

If you convert this to SAM or CRAM (not using samtools so the CIGAR is correct) and then convert to BAM using picard then we see the uncompressed BAM binary contains CGBi.

PS. It would also be good to get PG headers added. I only tried Picard as a punt, as there's nothing in the headers to indicate that is infact what did the format conversion. Please provide data provenance for format conversions.

lbergelson commented 3 years ago

@jkbonfield Thanks for reporting. Looks like an oversight where we forgot to set the "isUnsigned" bit when setting the tag value.

Adding provenance to format conversions is a good idea, but I'll have to think about where that could live. Currently the writer in htsjdk doesn't know the type that the reader read since it just gets the same SAMRecord object no matter what the input was. Is that @cmnbroad still the case in the new code? I assume it is for BAM/CRAM since we don't have different header versions that would be propagated.

Does htslib automatically add PG lines when performing conversions or is it something special in samtools view?