lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.55k stars 556 forks source link

Patch proposals for interactively working with BWA #199

Open nh13 opened 6 years ago

nh13 commented 6 years ago

Hey @lh3,

please feel free to point me to a different place if this discussion is more appropriate there. I'd like to your thoughts on whether you'd accept the following two patches before I go off and implement them. For motivation, we are finding it more and more useful to be able to interactively use BWA to query for alignments, and would like better ways to support that programmatically elsewhere and in different languages. We do not want to proceed with this work without some agreement for you that you would accept these patches, after your code review. Any feedback would be appreciated on either proposal.

Proposal 1: Support an "interactive" mode in bwa mem

If bwa would interpret consecutive newlines in the stdin as "end-of-batch", then it could process all reads read in so far, and write the results to stdout. This would allow us to spin up bwa in the background, then pipe one or many reads into bwa and read them back in, and then submit a new batch of reads. The key here is that we don't want to have bwa align all reads before getting back the results, just the particular ones at the moment. We did this a while ago for bwa aln for reference: https://github.com/fulcrumgenomics/bwa/tree/interactive_aln.

Proposal 2: API method for bwa mem to return a sam or sam-like struct

We often find that we wish we could run bwa mem on one or a small set of reads, and have a SAM-like struct returned for each (ex. htslib's bam1_t). We may be aligning paired end reads, or fragment (single end/unpaired) reads. We want access to all the fields that SAM affords, including optional tags and the header (produced by bwa). I see two ways to implement this: (1).create a light-weight struct in bwa and return that, or (2) depend on htslib and return bam1_t for each read. We can then create a pybwa or bwajdk or scalabwa as we desire that natively calls bwa. Libraries like https://github.com/nanoporetech/bwapy are good start, but are limited in that they don't support paired end reads and do not give all the info found in a sam-like-record.

Current solution

The current way we achieve proposals 1 & 2 is a bit embarrassing, but basically we run bwa on a batch of reads in the shell writing to the process' stdin and reading from its stdout. We have to execute bwa for each batch of reads, therefore forcing us to batch many reads together to not only get performance, and for paired end reads, infer the insert size correctly.