samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
279 stars 244 forks source link

CRAM: types for HUFFMAN codecs #518

Open jkbonfield opened 8 years ago

jkbonfield commented 8 years ago

Subject of the issue

In an attempt to produce a cram_filter tool to drop specific columns, I seem to have hit an issue in the htsjdk decoding which I am unsure about.

Specifically when I drop the quality string, I replace it with a HUFFMAN codec with 1 value (fixed quality, eg 40) and 1 bit-length (zero). Ideally I'd use quality 255 so the quality comes back at "*", but for some reason that's not working which I'm still getting to grips with. However q40 should be sufficient.

Such a file decodes fine with samtools and scramble, but complains with cramtools (see below).

The error here is that there is no BYTE_ARRAY type for the HUFFMAN decoder. I cannot see anything in the CRAM specification that indicates which input types (integer, char, array, etc) are permitted for each codec type, so I assume all are permitted and we simply implement the combinations that (pragmatically) we've needed to date.

Your environment

Decode the attached cram file. a.cram.zip

Expected behaviour

I should be seeing quality strings full of "I" (Q40), assuming that decoding an array of items using HUFFMAN is permitted. (It's definitely not clear from the spec, so maybe I'm simply out of the spec here.)

Unfortunately doing this any other way is very hard. To completely remove qualities, as we could do with lossy quality mode, requires changing the CF data series for every read to rewrite one of the bit flags. This is substantially more work than simply munging the compression header and skipping a block.

Actual behaviour

@ seq3a[scratch01/jkb]; /software/jdk1.8.0_74/bin/java -Xmx4g -jar ~/work/cram/cramtools-3.0.jar bam -I a.cram -R hg19/all.fa --skip-md5-check
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at net.sf.cram.CramTools.invoke(CramTools.java:91)
    at net.sf.cram.CramTools.main(CramTools.java:121)
Caused by: java.lang.RuntimeException: Encoding not found for value type BYTE_ARRAY, id=HUFFMAN
    at htsjdk.samtools.cram.encoding.reader.DataReaderFactory.createReader(DataReaderFactory.java:97)
    at htsjdk.samtools.cram.encoding.reader.DataReaderFactory.buildReader(DataReaderFactory.java:62)
    at htsjdk.samtools.cram.build.ContainerParser.getRecords(ContainerParser.java:101)
    at htsjdk.samtools.cram.build.ContainerParser.getRecords(ContainerParser.java:155)
    at htsjdk.samtools.cram.build.ContainerParser.getRecords(ContainerParser.java:60)
    at net.sf.cram.Cram2Bam.main(Cram2Bam.java:206)
    ... 6 more
cmnbroad commented 8 years ago

@vadimzalunin Can you take a look. IIRC this may be the result of the multiref PR ?

vadimzalunin commented 8 years ago

qscore 255 (-1) is used internally for lossy models to mask discarded positions so it may interfere, I need to check.

The reason for not implementing Huffman<byte[]> is that it seemed impractical but now this looks like a good use case.