Open fnothaft opened 7 years ago
Can't you use the BamIndexer with an OutputStream to write the index to where-ever you need? Why do you need the Linear Index object itself?
On Thu, Aug 31, 2017 at 10:55 AM, Frank Austin Nothaft < notifications@github.com> wrote:
The LinearIndex class is public https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/LinearIndex.java, but there is no public method that allows one to get a LinearIndex. The only way to get a LinearIndex is through getting a BAMIndexContent, which you can only get from calling the protected method AbstractBAMFileIndex. getQueryResults https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/AbstractBAMFileIndex.java#L358. This hampers how the linear BAM index can be used in downstream projects (e.g., HadoopGenomics/Hadoop-BAM#140 https://github.com/HadoopGenomics/Hadoop-BAM/pull/140)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/975, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0q4ngs6LU1frT2XZCWHTihzOfOGJks5sdmbWgaJpZM4PIdHq .
In https://github.com/HadoopGenomics/Hadoop-BAM/pull/140, we add support for using a .bai to split a BAM file into (roughly) equivalent sized chunks for parallel read. When we split the BAM file up, we want to choose locations where we know that a valid BGZF block starts, else we have to seek around and try to guess where a BGZF block is. If we have the linear index, this process is trivial.
There are also other worthwhile uses for the linear index: http://www.biorxiv.org/content/early/2017/06/09/148296
The "other worthwhile uses" (i.e. counting the number of records in each contig) can be implemented using sam.indexing().getIndex().getMetaData()
For your purpose isn't BAMFileSpan the object you should be using?
meaning, sam.indexing().getIndex().getSpanOverlapping()-> BAMFileSpan
The biorxiv posting I linked to uses the linear index to do finer grained coverage estimation than that (specifically, it uses the size of each block in the linear index to roughly approximate the coverage of that region; yes, an imperfect approximation, but a useful one for quick estimations).
For your purpose isn't BAMFileSpan the object you should be using? meaning, `sam.indexing().getIndex().getSpanOverlapping()-> BAMFileSpan
Ah, alas, getSpanOverlapping
requires knowledge of the coordinate range that you're looking for. We're looking for a start/end offset (in bytes) into a file, and we then want to find the positions of the first BGZF block after the start/end offsets.
Am I missing something?
I see. I mean, I suppose that would work as well, it's just a bit clunkier than having direct access to the LinearIndex
, which is ever so slightly easier to traverse. Is there a reason that htsjdk doesn't want to allow public access to the LinearIndex
? The LinearIndex
class itself is already public.
For that you'd have to ask the maintainers... @tfenne, @jacarey, @lbergelson ?
But, I suspect that an issue could be that of encapsulation, namely that you shouldn't need to know which type of index was used, etc. The method described above might also work for a CRAM for example...
But, I suspect that an issue could be that of encapsulation, namely that you shouldn't need to know which type of index was used, etc. The method described above might also work for a CRAM for example...
Ah, that'd be useful if true... In any case, we've got a workaround for now, so I'll hang around for an answer from the maintainers.
The
LinearIndex
class is public, but there is no public method that allows one to get aLinearIndex
. The only way to get aLinearIndex
is through getting aBAMIndexContent
, which you can only get from calling the protected methodAbstractBAMFileIndex.getQueryResults
. This hampers how the linear BAM index can be used in downstream projects (e.g., https://github.com/HadoopGenomics/Hadoop-BAM/pull/140)