Open jkbonfield opened 8 years ago
@vadimzalunin Can you take a look. IIRC this may be the result of the multiref PR ?
qscore 255 (-1) is used internally for lossy models to mask discarded positions so it may interfere, I need to check.
The reason for not implementing Huffman<byte[]> is that it seemed impractical but now this looks like a good use case.
Subject of the issue
In an attempt to produce a cram_filter tool to drop specific columns, I seem to have hit an issue in the htsjdk decoding which I am unsure about.
Specifically when I drop the quality string, I replace it with a HUFFMAN codec with 1 value (fixed quality, eg 40) and 1 bit-length (zero). Ideally I'd use quality 255 so the quality comes back at "*", but for some reason that's not working which I'm still getting to grips with. However q40 should be sufficient.
Such a file decodes fine with samtools and scramble, but complains with cramtools (see below).
The error here is that there is no BYTE_ARRAY type for the HUFFMAN decoder. I cannot see anything in the CRAM specification that indicates which input types (integer, char, array, etc) are permitted for each codec type, so I assume all are permitted and we simply implement the combinations that (pragmatically) we've needed to date.
Your environment
Steps to reproduce
Decode the attached cram file. a.cram.zip
Expected behaviour
I should be seeing quality strings full of "I" (Q40), assuming that decoding an array of items using HUFFMAN is permitted. (It's definitely not clear from the spec, so maybe I'm simply out of the spec here.)
Unfortunately doing this any other way is very hard. To completely remove qualities, as we could do with lossy quality mode, requires changing the CF data series for every read to rewrite one of the bit flags. This is substantially more work than simply munging the compression header and skipping a block.
Actual behaviour