samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
285 stars 243 forks source link

LinearIndex cannot be obtained due to method privacy #975

Open fnothaft opened 7 years ago

fnothaft commented 7 years ago

The LinearIndex class is public, but there is no public method that allows one to get a LinearIndex. The only way to get a LinearIndex is through getting a BAMIndexContent, which you can only get from calling the protected method AbstractBAMFileIndex.getQueryResults. This hampers how the linear BAM index can be used in downstream projects (e.g., https://github.com/HadoopGenomics/Hadoop-BAM/pull/140)

yfarjoun commented 7 years ago

Can't you use the BamIndexer with an OutputStream to write the index to where-ever you need? Why do you need the Linear Index object itself?

On Thu, Aug 31, 2017 at 10:55 AM, Frank Austin Nothaft < notifications@github.com> wrote:

The LinearIndex class is public https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/LinearIndex.java, but there is no public method that allows one to get a LinearIndex. The only way to get a LinearIndex is through getting a BAMIndexContent, which you can only get from calling the protected method AbstractBAMFileIndex. getQueryResults https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/AbstractBAMFileIndex.java#L358. This hampers how the linear BAM index can be used in downstream projects (e.g., HadoopGenomics/Hadoop-BAM#140 https://github.com/HadoopGenomics/Hadoop-BAM/pull/140)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/975, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0q4ngs6LU1frT2XZCWHTihzOfOGJks5sdmbWgaJpZM4PIdHq .

fnothaft commented 7 years ago

In https://github.com/HadoopGenomics/Hadoop-BAM/pull/140, we add support for using a .bai to split a BAM file into (roughly) equivalent sized chunks for parallel read. When we split the BAM file up, we want to choose locations where we know that a valid BGZF block starts, else we have to seek around and try to guess where a BGZF block is. If we have the linear index, this process is trivial.

fnothaft commented 7 years ago

There are also other worthwhile uses for the linear index: http://www.biorxiv.org/content/early/2017/06/09/148296

yfarjoun commented 7 years ago

The "other worthwhile uses" (i.e. counting the number of records in each contig) can be implemented using sam.indexing().getIndex().getMetaData()

yfarjoun commented 7 years ago

For your purpose isn't BAMFileSpan the object you should be using?

yfarjoun commented 7 years ago

meaning, sam.indexing().getIndex().getSpanOverlapping()-> BAMFileSpan

fnothaft commented 7 years ago

The biorxiv posting I linked to uses the linear index to do finer grained coverage estimation than that (specifically, it uses the size of each block in the linear index to roughly approximate the coverage of that region; yes, an imperfect approximation, but a useful one for quick estimations).

For your purpose isn't BAMFileSpan the object you should be using? meaning, `sam.indexing().getIndex().getSpanOverlapping()-> BAMFileSpan

Ah, alas, getSpanOverlapping requires knowledge of the coordinate range that you're looking for. We're looking for a start/end offset (in bytes) into a file, and we then want to find the positions of the first BGZF block after the start/end offsets.

yfarjoun commented 7 years ago
  1. open bam header, get dictionary
  2. sam.indexing().getIndex().getSpanOverlapping() for all the contigs in the dictionary (unclear to me at this point, how does one get the FileSpans containing the unmapped records at the end?)
  3. use FileSpans to get a list of the Chunk start/stop positions in the file
  4. use in Hadoop...

Am I missing something?

fnothaft commented 7 years ago

I see. I mean, I suppose that would work as well, it's just a bit clunkier than having direct access to the LinearIndex, which is ever so slightly easier to traverse. Is there a reason that htsjdk doesn't want to allow public access to the LinearIndex? The LinearIndex class itself is already public.

yfarjoun commented 7 years ago

For that you'd have to ask the maintainers... @tfenne, @jacarey, @lbergelson ?

But, I suspect that an issue could be that of encapsulation, namely that you shouldn't need to know which type of index was used, etc. The method described above might also work for a CRAM for example...

fnothaft commented 7 years ago

But, I suspect that an issue could be that of encapsulation, namely that you shouldn't need to know which type of index was used, etc. The method described above might also work for a CRAM for example...

Ah, that'd be useful if true... In any case, we've got a workaround for now, so I'll hang around for an answer from the maintainers.