pezmaster31 / bamtools

C++ API & command-line toolkit for working with BAM data
MIT License
415 stars 152 forks source link

Speeding up BamMultiReader::Rewind()? #52

Open PyGuan opened 12 years ago

PyGuan commented 12 years ago

I need to get the list of all reads at a long list of genomic locations.

This requires me to call BamMultiReader::Rewind() multiple times. Hence I need the method to run as fast as possible.

Is there any way I can speed this up?

pezmaster31 commented 12 years ago

If you know the positions, there is no need to rewind manually. If you use the reader's SetRegion() method to define your desired range (or position), then looping over the region with GetNextAlignment() will limit the fetched alignments to only those overlap that range. Just a heads-up: random-access requires having an index file.

Something like the following (pseudo-C++) should work for looping over a list of regions and grabbing alignments:

BamMultiReader reader;
if ( !reader.Open(filenames) ) {
    // handle any "could not open file(s)" error
}

if ( !reader.LocateIndexes() ) {
    if ( !reader.CreateIndexes() ) {
        // handle any "could not find/create index file(s)" error 
    }
}

BamAlignment a;
foreach ( position, positions ) {
    int refId = reader.GetRefID(position.chromName);
    if ( refId < 0 )
        continue; // unknown chrom name 

    BamRegion region(refId, position.start, refId, position.stop);
    if ( !reader.SetRegion(region) )
        continue; // could not set region

    while ( reader.GetNextAlignment(a) ) {
        // do stuff with alignment 'a' 
    } 
}

Please let me know if you have any other questions (or if I've misunderstood your situation).

PyGuan commented 12 years ago

Hi Derek

Thank you for your prompt reply & the pseudo code.

I guess I didn't describe my situation clear enough. Let me give you an example:

If we have two regions [x,y) and [u,v) that are close enough to share some alignments.

Let's assume that: [x,y) contains alignments a1,a2,a3,a4 and [u,v) contains alignment a3,a4,a5,a6.

The alignments a3 and a4 are shared.

After populating all the alignments in [x,y), if I don't call "reader.Rewind", when I try to populate all alignments in [u,v) via "reader.GetNextAlignment", I will only get a5 and a6 for the region [u,v).

Is there any more efficient way to get a3, a4, a5 and a6 for [u,v) without calling "reader.Rewind"?

pezmaster31 commented 12 years ago

Sounds like my advice still fits your case, unless I'm really not following you.

Is there any more efficient way to get a3, a4, a5 and a6 for [u,v) without calling "reader.Rewind"?

Yes. SetRegion() is what you're looking for.

SetRegion() sets up an internal left/right bound within the reader and jumps to the first alignment in the BAM file which overlaps the left bound. Further calls to GetNextAlignment() will return true until you hit the first alignment with starting position >= y, at which point you've fetched all of the alignments overlapping the [x, y) region.

If you want to limit the alignments to only those with a start position within your region (excluding any that may begin before and overlap the left bound), you can add the following check:

reader.SetRegion(refId, x, refId, y);
while ( reader.GetNextAlignment(a) ) {
    if ( a.Position >= x ) {
        // then do something with alignment
    }
}

In the above example, you'll get any (and only those) alignments with a start position within [x, y).

ps - calling reader.Rewind() moves you all the way back to the beginning of the alignment records, so yeah that's really inefficient for range queries. It should rarely be needed unless you want to do full passes through the entire data set.

pezmaster31 commented 12 years ago

So, just to clarify, and apply to your example:

reader.SetRegion(refId, x, refId, y);
while ( reader.GetNextAlignment(a) ) {
    if ( a.Position >= x ) {
        // then do something with alignment
    }
}

reader.SetRegion(refId, u, refId, v);
while ( reader.GetNextAlignment(a) ) {
    if ( a.Position >= u ) {
        // then do something with alignment
    }
}

Should get you all the alignments starting within both of your intervals, regardless of whether the intervals overlap each other or not.

PyGuan commented 12 years ago

If you don't mind, could you please help to double confirm whether the current SetRegion "jumps to the first alignment in the BAM file which overlaps the left bound"?

Based on my observation and testing, it doesn't seem jumping. It seems that it will start from the GetNextAlignment pointer of [x,y) when retrieving the alignments for [u,v). Whatever alignments that were already populated for region [x,y) will not appear in [u,v).

Thank you.

pezmaster31 commented 12 years ago

Just to double-check - you're calling SetRegion(refId, u, refId, v) after getting all of the [x, y) alignments? But it doesn't actually move the file pointer?

PyGuan commented 12 years ago

Yes, I called SetRegion(refId, u, refId, v) after getting all of the [x, y) alignments. The pointer doesn't seem to move.

My codes are copied below for your reference:

BamMultiReader reader;
//Open the BAM file.
cout << "Opening BAM file...." << std::endl;
if(!reader.Open(bamFilenames)) {
    cerr << "Could not open input BAM files." << endl;
}
else {
    cout << "BAM file opened" << std::endl;
}

BamAlignment al;

const BamRegion currBr (currRefID, readPos1, currRefID, readPos1+1);
reader.SetRegion(currBr);
while (reader.GetNextAlignment(al)) {
    cout << al.Name << std::endl;
}

//reader.Rewind();  

const BamRegion currBr2 (currRefID, readPos2, currRefID, readPos2+1);
reader.SetRegion(currBr2);
while (reader.GetNextAlignment(al)) {
    cout << al.Name << std::endl;
}

reader.Close();
GuyDelMistro commented 8 years ago

I can't seem to get SetRegion() to work correctly.

Initially I was scanning many regions but reduced it down to 1 closed region (the first of 1 BAM file). The (coverage) for the reads (BamAlignment's) returned were different to that used by samtools depth (a few reads were omitted!). I also verified my code by applying my own region filter to the reads without using using SetRegion() and got exactly the same results as samtools.

Is there a bug with read filtering with SetRegions() or am I using it wrong? (I assumed it takes BED coordinates but I tried all +/-1 combos on regions start and end locations and this did not help.)

One thing that makes me believe the range filtering is almost certainly buggy is that when I decreased the region start position by 1 (w/o changing end position) GetNextAlignmentCore() returned fewer reads!

Range start = 43814968, number of overlapping alignments = 3523

Range start = 43814969, number of overlapping alignments = 3535

GuyDelMistro commented 8 years ago

Curiously my same code worked if I left all regions open at the right and called SetRegions() when are read start exceeded the last target position.

However, there appears to be a massive performance hit for calling SetRegion() again. With a single small BAM (876,499 reads) and 207 enriched target regions the run took 21 minutes!!! (This was using LocateIndexes() for a BAI file.)

In contrast, reading the whole BAM and direct alignment start/end filtering took only 5.0 seconds.

For comparison, samtools depth -l took 9.0 seconds, which is known to have serious performance issues. (Akin to samtools view with a BED file vs. using the region argument.)

Btw, using Rewind() is not an option here as I do not want alignments overlapping multiple regions in my BED file to be counted more than once.

GuyDelMistro commented 8 years ago

Discovered with a bit of work that I could use BamMultiReader::Jump() to implement an efficient working alternative to SetRegion(). Performance is way superior to samtools (depth) with BED file.

GuyDelMistro commented 8 years ago

There appears to be a serious issue using CreateIndexes() and then Jump() to use the new index file (at least for the same BamReader object).

When I tried this, after the second Jump() the next call to bamReader.GetNextAlignmentCore() returned 0. However, when I run the exact same code for same inputs using the BAI created (now this is detected using LocateIndexes()) the code works perfectly.

I tried using LocateIndexes() and Rewind() after CreateIndexes() but this did not help. However, using a different BamMultiReader object to do CreateIndexes() from that doing LocateIndexes() works fine.