please feel free to point me to a different place if this discussion is more appropriate there. I'd like to your thoughts on whether you'd accept the following two patches before I go off and implement them. For motivation, we are finding it more and more useful to be able to interactively use BWA to query for alignments, and would like better ways to support that programmatically elsewhere and in different languages. We do not want to proceed with this work without some agreement for you that you would accept these patches, after your code review. Any feedback would be appreciated on either proposal.
Proposal 1: Support an "interactive" mode in bwa mem
If bwa would interpret consecutive newlines in the stdin as "end-of-batch", then it could process all reads read in so far, and write the results to stdout. This would allow us to spin up bwa in the background, then pipe one or many reads into bwa and read them back in, and then submit a new batch of reads. The key here is that we don't want to have bwa align all reads before getting back the results, just the particular ones at the moment. We did this a while ago for bwa aln for reference: https://github.com/fulcrumgenomics/bwa/tree/interactive_aln.
Proposal 2: API method for bwa mem to return a sam or sam-like struct
We often find that we wish we could run bwa mem on one or a small set of reads, and have a SAM-like struct returned for each (ex. htslib's bam1_t). We may be aligning paired end reads, or fragment (single end/unpaired) reads. We want access to all the fields that SAM affords, including optional tags and the header (produced by bwa). I see two ways to implement this: (1).create a light-weight struct in bwa and return that, or (2) depend on htslib and return bam1_t for each read. We can then create a pybwa or bwajdk or scalabwa as we desire that natively calls bwa. Libraries like https://github.com/nanoporetech/bwapy are good start, but are limited in that they don't support paired end reads and do not give all the info found in a sam-like-record.
Current solution
The current way we achieve proposals 1 & 2 is a bit embarrassing, but basically we run bwa on a batch of reads in the shell writing to the process' stdin and reading from its stdout. We have to execute bwa for each batch of reads, therefore forcing us to batch many reads together to not only get performance, and for paired end reads, infer the insert size correctly.
Hey @lh3,
please feel free to point me to a different place if this discussion is more appropriate there. I'd like to your thoughts on whether you'd accept the following two patches before I go off and implement them. For motivation, we are finding it more and more useful to be able to interactively use BWA to query for alignments, and would like better ways to support that programmatically elsewhere and in different languages. We do not want to proceed with this work without some agreement for you that you would accept these patches, after your code review. Any feedback would be appreciated on either proposal.
Proposal 1: Support an "interactive" mode in
bwa mem
If
bwa
would interpret consecutive newlines in thestdin
as "end-of-batch", then it could process all reads read in so far, and write the results tostdout
. This would allow us to spin upbwa
in the background, then pipe one or many reads intobwa
and read them back in, and then submit a new batch of reads. The key here is that we don't want to havebwa
align all reads before getting back the results, just the particular ones at the moment. We did this a while ago forbwa aln
for reference: https://github.com/fulcrumgenomics/bwa/tree/interactive_aln.Proposal 2: API method for
bwa mem
to return a sam or sam-like structWe often find that we wish we could run
bwa mem
on one or a small set of reads, and have a SAM-like struct returned for each (ex. htslib'sbam1_t
). We may be aligning paired end reads, or fragment (single end/unpaired) reads. We want access to all the fields that SAM affords, including optional tags and the header (produced bybwa
). I see two ways to implement this: (1).create a light-weight struct inbwa
and return that, or (2) depend onhtslib
and returnbam1_t
for each read. We can then create apybwa
orbwajdk
orscalabwa
as we desire that natively callsbwa
. Libraries like https://github.com/nanoporetech/bwapy are good start, but are limited in that they don't support paired end reads and do not give all the info found in a sam-like-record.Current solution
The current way we achieve proposals 1 & 2 is a bit embarrassing, but basically we run
bwa
on a batch of reads in theshell
writing to the process'stdin
and reading from itsstdout
. We have to executebwa
for each batch of reads, therefore forcing us to batch many reads together to not only get performance, and for paired end reads, infer the insert size correctly.