samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

HTSJDK should not accept non-two-character header tags #1477

Closed jmarshall closed 3 years ago

jmarshall commented 4 years ago

Description of the issue:

HTSJDK does not produce a diagnostic when reading header tags that are not two characters in length. For example,

@SQ SN:one  LN:1000
@RG ID:A18  PL:Illumina LNID:5  FCID:HHH7NBBXX  DT:2017-01-10T00:00:00+0100 BCID:GTCCGC SM:A18-germline
r1  0   one 50  20  4M  *   0   0   ATGC    qqqq    RG:Z:A18    NM:i:0

In this file, LNID:5, FCID:…, and BCID:… on the @RG line do not match the pattern required for a SAM header line, as there are not exactly two characters before the colon.

Steps to reproduce

I would expect this to be diagnosed by e.g.

picard ValidateSamFile I=badtag.sam

or additionally by ViewSam or any other SAM file reading.

Expected behaviour

ValidateSamFile error message to the effect of

ERROR   2020-05-04 20:08:28 ValidateSamFile Header tag does not have length() == 2: LNID

(This is edited from the error message that does occur for an alignment record with e.g. LNID:i:5 as an invalidly-tagged aux field.)

Actual behaviour

WARNING 2020-05-04 20:13:53 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
No errors found
jmarshall commented 4 years ago

Example file badtag.sam comes from samtools/samtools#1237 — unusually, HTSlib has been stricter here than HTSJDK!

lbergelson commented 4 years ago

I sort of assumed it would have complained if it saw something like that, thanks for pointing this out.

jmarshall commented 4 years ago

I too assumed SAMTextHeaderCodec.java's ParsedHeaderLine would have always complained about this, but it seems not. I guess with neither major implementation (prior to htslib 1.10 paying attention to headers) noticing this problem, this is how the samtools issue OP's professor's BAM file has been like this since 2017. Sigh…