Open jkbonfield opened 3 years ago
@jkbonfield Thanks for reporting. Looks like an oversight where we forgot to set the "isUnsigned" bit when setting the tag value.
Adding provenance to format conversions is a good idea, but I'll have to think about where that could live. Currently the writer in htsjdk doesn't know the type that the reader read since it just gets the same SAMRecord object no matter what the input was. Is that @cmnbroad still the case in the new code? I assume it is for BAM/CRAM since we don't have different header versions that would be propagated.
Does htslib automatically add PG lines when performing conversions or is it something special in samtools view?
Description of the issue:
Using picard
SamFormatConverter
to turn a long-read CRAM to BAM with more than 65536 cigar ops generates aCG:B:i,
tag. The SAM specification dictates this must beCG:B:I,
. This may be overly picky, but samtools validates against and thus rejects this data. Obviously we can make Samtools more lenient, but it would also be good to correct the issue at source as it is generating malformed data that will cause issues for older copies of samtools.Your environment:
Whatever version of htsjdk is bundled in Picard 2.25.7. (I'm assuming this issue is part of htsjdk and not picard.) openjdk 11.0.11 Ubuntu 18.04.5 LTS (bionic)
Steps to reproduce
See the file reported in https://github.com/samtools/samtools/issues/1477
If you convert this to SAM or CRAM (not using samtools so the CIGAR is correct) and then convert to BAM using picard then we see the uncompressed BAM binary contains
CGBi
.PS. It would also be good to get
PG
headers added. I only tried Picard as a punt, as there's nothing in the headers to indicate that is infact what did the format conversion. Please provide data provenance for format conversions.