samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Add flag to open CRAM file without reference sequence #1570

Closed jdotsoft closed 2 years ago

jdotsoft commented 3 years ago

There are cases when CRAM file should be open without dependency on reference sequence. See example http://www.jdotsoft.com/CovReport2.php

The default HTSJDK implementation throws if reference sequence is not provided:

java.lang.IllegalArgumentException: A reference must be supplied that includes the reference sequence for chr1.
  at htsjdk.samtools.cram.ref.CRAMLazyReferenceSource.getReferenceBases(CRAMLazyReferenceSource.java:41)
  at htsjdk.samtools.cram.build.CRAMReferenceRegion.getReferenceBases(CRAMReferenceRegion.java:74)
  at htsjdk.samtools.cram.structure.Slice.normalizeCRAMRecords(Slice.java:450)
  at htsjdk.samtools.cram.structure.Container.getSAMRecords(Container.java:322)
  at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:112)
  at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:204)
  at htsjdk.samtools.SamReader$AssertingIterator.hasNext(SamReader.java:608)

I implemented ugly workaround to bypass the required reference sequence in htsjdk.samtools.cram.structure.Container class:

    public List<SAMRecord> getSAMRecords(
            final ValidationStringency validationStringency,
            final CRAMReferenceRegion cramReferenceRegion,
            final CompressorCache compressorCache,
            final SAMFileHeader samFileHeader) {
        final List<SAMRecord> samRecords = new ArrayList<>(getContainerHeader().getNumberOfRecords());
        for (final Slice slice : getSlices()) {
            final List<CRAMCompressionRecord> cramCompressionRecords = slice.deserializeCRAMRecords(compressorCache, validationStringency);
            // before we convert to SAMRecord, we need to normalize the CRAMCompressionRecord in each Slice
            // [JS] commented out next three lines:
            //slice.normalizeCRAMRecords(
            //        cramCompressionRecords,
            //        cramReferenceRegion);
            for (final CRAMCompressionRecord cramCompressionRecord : cramCompressionRecords) {
                cramCompressionRecord.setIsNormalized(); // [JS] added
                final SAMRecord samRecord = cramCompressionRecord.toSAMRecord(samFileHeader);
                samRecord.setValidationStringency(validationStringency);
                samRecords.add(samRecord);
            }
        }
        return samRecords;
    }

There was a similar closed issue #1407.

Please consider adding option flag, e.g. SamReaderFactory.Options.EXCLUDE_REFERENCE_SEQUENCE with default option false to open CRAMFileReader without reference sequence.

Thanks Mark

cmnbroad commented 3 years ago

@jdotsoft There are certainly cases where a reference is not required, and the htsjdk CRAM stack already supports this via the default lazy reference (which you appear to already be using). It throws only if you try to do something that actually DOES require the reference. As you've discovered, Container.getSAMRecords is one such method; it always requires the actual reference unless the underlying CRAM has an embedded reference.

The SAMRecords you're obtaining using your workaround (bypassing normalization) are incomplete, and don't accurately reflect the underlying CRAM. Its certainly possible to use htsjdk to write custom code that cherry-picks the data you want, but it requires a pretty solid understanding of the API, and the CRAM format.

I'd be very reticent to add a general flag that allows partially formed SAMRecords to be accessed through SamReader like this though, if thats what you're requesting.

jdotsoft commented 3 years ago

Could you provide code example to scan CRAM records and avoid calling Container.getSAMRecords ?

I'd like to use HTSJDK out-of-the-box without any customization. I did my update to HTSJDK to implement the code below and to bypass requirement to have reference sequence. I wish I have a better and cleaner solution.

My current implementation is the following:

    SamReader samReader = SamReaderFactory.makeDefault().open(file);
    for (SAMRecord rec : samReader) {
      // processing BAM or CRAM record (no sequence required)
    }

Unfortunately iterator calls Container.getSAMRecords ...

Thanks Mark

cmnbroad commented 2 years ago

Closing as won't fix per discussion above.