walaj / SeqLib

C++ htslib/bwa-mem/fermi interface for interrogating sequence data
http://bioinformatics.oxfordjournals.org/content/early/2016/12/21/bioinformatics.btw741.full.pdf+html
Other
132 stars 36 forks source link

Paired end/mate mair reads #18

Open Ahhgust opened 7 years ago

Ahhgust commented 7 years ago

First I'd like to thank you for the time you've dedicated to this project. I really (for one) really appreciate it!

I have two questions than can be construed as "issues". First, it's a little unclear how one might use the BWAWrapper object to perform paired end read mapping. I see that we have access to the underlying bwa-mem routines, but I have to admit it's a little daunting. I don't suppose a routine could be added to BWAWrapper to illustrate how this might be done (this is not entirely a selfless question, I realize, but I do imagine that it would be of general use to the community)?

Second, it's not entirely clear (from the documentation) how and if any of the routines you have provided are thread safe, or if multithreading is happening behind the scenes (it appears that the underlying bwa-mem routines may be doing some sort of bag of tasks multithreading.). Obviously there is some degree of state that is being recorded (getters/setters), so the potential is certainly there for race conditions.

Thanks again, and I apologize if this is not the correct medium to ask this of you. -August Woerner

walaj commented 7 years ago

Hi August,

I'm glad you're finding this project useful, thank you! Your questions are good ones, and let me try to address what I can at the moment:

Currently, I am not using BWAs paired-end algorithms. From what I can tell from working through the code, BWA generates and alignment for each pair, and then has a function that takes the individual alignments and does some joint inference to refine. This has been on my to-do list for expanding the functionality of BWAWrapper, but hasn't been implemented yet. In principle, it wouldn't be too difficult.

There should be no hidden multithreading, except that perhaps the way BWA gets called is that it starts a single worker thread. Do you have some valgrind output that is suggesting that there are extra threads being started?

I try to make functions "const" wherever possible to suggest that they are thread-safe. An exception is the const function QueryRegion in RefGenome, which is not thread-safe because it calls a non-thread-safe function in htslib to retrieve sequence from a reference genome (I asked htslib folks). I agree that more systematically documenting these would be the best approach.

Are there specific aspects re: thread safety we should prioritize looking at?

Ahhgust commented 7 years ago

Hi Jeremiah Thanks for getting back to me so quickly! I guess my threading questions were rather simplistic. I plan on using seqlib, and the code base I have is already multithreaded-- I just wanted to ensure that, say, calling AlignSequence would be thread-safe. It also seems like if BWA already has multithreading implemented in it that it might be nice if that were exposed in your interface. I was also curious as to whose approach to multithreading is faster (mine or Heng's). As for the paired end mapping, the more I think about it I think it makes more sense for me to merge the paired end reads and then map the merged pair (but my case is a rather specialized one). I did, however, run into another issue. Valgrind is reporting a fair number of memory problems with the BamWriter class (a bunch of "use of uninitialised value of size ..." errors ).
Here's my g++ statement (64-bit Ubuntu 14.04): g++ -Wno-sign-compare -g -I/usr/local/src/SeqLib/SeqLib -I/usr/local/src/SeqLib/SeqLib/htslib quickie.cpp /usr/local/src/SeqLib /SeqLib/bin/libseqlib.a /usr/local/src/SeqLib/SeqLib/bin/libbwa.a /usr/local/src/SeqLib/SeqLib/bin/libhts.a -lpthread -lm -lz Here's the command: head -n 4 oog.fq | valgrind ./a.out /dev/stdin

And here's the source (quickie.cpp)

include

include

include "SeqLib/BamReader.h"

include "SeqLib/BamWriter.h"

include "SeqLib/FastqReader.h"

include "SeqLib/BWAWrapper.h"

include "SeqLib/FermiAssembler.h"

include "SeqLib/UnalignedSequence.h"

include "SeqLib/SeqPlot.h"

include "SeqLib/GenomicRegion.h"

using namespace SeqLib; using namespace std;

const string REFGENOME = "chrAll.standardChroms.fa";

int main(int argc, char **argv) {

if (argc < 2) { cerr << "I need a fastq file to work!\n" << endl; return 1; }

/ Mapping w/ BWA / BWAWrapper bwa;

bwa.LoadIndex(REFGENOME);

bwa.SetGapOpen(4); bwa.SetGapExtension(1); bwa.Set3primeClippingPenalty(1000); bwa.Set5primeClippingPenalty(1000);

BamRecordVector results; BamWriter bamWriter( SeqLib::BAM ); bamWriter.Open( "test.bam"); bamWriter.SetHeader( bwa.HeaderFromIndex() ); bamWriter.WriteHeader();

FastqReader q(argv[1]); UnalignedSequence s;

while (q.GetNextSequence(s) ) { bwa.AlignSequence(s.Seq, s.Name, results, false, 0.9, 0); if (! results.empty() ) { if (!bamWriter.WriteRecord( results.at(0) )) { cerr << "Failed to write record!" << endl; } break; }

}

return 0; }

And here's oog.fq:

@M01451:39:000000000-AGNHG:1:1101:18315:2050 1:N:0:25 ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT

Thanks again! -August

walaj commented 7 years ago

Hi August,

BWA scoops up many sequences at once, and then loops through the alignment function (presumably to minimize number of fetches from a file). AlignSequence from BWAWrapper jumps into BWA at the aligner part, side stepping BWA's gathering of sequences. I wouldn't think there would be a big difference between looping AlignSequence in SeqLib and the loop BWA does. Regarding multithreading, BWA just grabs more sequences at once, and then distributes to different threads. Since SeqLib uses BWA after this step, there is no use of BWAs multithreading in SeqLib. I think this is simpler, and the SeqLib user can multi-thread how they like. All my uses of SeqLib have their own multithread framework, and I align within each thread to a single shared-memory BWAWrapper with no problem. I've never found AlignSequence to be non-thread safe, and I've run many whole genome jobs with 20+ cores all accessing a single object. The main workhorse of it is threadsafe in BWA itself.

I've seen those valgrind warnings on BamWriter, they are innocuous. It is saying that they are coming from an uninitialized value in libz, so I'm not worried. Valgrind can be very strict. If you think it's causing problems though, I'd be interested to know though what the issues are.

Not sure about the best paired-end strategy, but am guessing that adapting BWA's method would be best. If you wanted to make a new BWAWrapper function to handle pairs of reads, I would welcome the PR!

Hope this helps, Jeremiah

martinjvickers commented 6 years ago

Hi @walaj, I'm also interested in using the SeqLib BWA wrapper for PE reads. You mentioned;

From what I can tell from working through the code, BWA generates and alignment for each pair, and then has a function that takes the individual alignments and does some joint inference to refine.

Could you point me in the direction of where this is in bwa? I've been struggling to work out what functions are related to this.