Closed marcelm closed 2 years ago
So instead of this:
It would look something like this (not sure if this would work):
if (all_nams.size() == 0) {
sam_file.append(SamRecord::unmapped(query_acc, read, qual));
return;
}
Have we excluded using an external library? I know depending on extend this might end up as a major rewrite, but something like e.g. https://www.seqan.de/ would give us all the I/O, data-structures for reads and alignments and a whole bunch of algorithms like banded Smith Waterman or the like in it which might be useful. though it is a pretty hefty thing to include in a project.
I am scared such changes because (i) extra external libraries/compilation and particularly (ii) speed. I am not an expert on C++, but what I did spent great care with for this tool was to optimize speed to the extent that I possibly could. Unless seqan has a state-of-the-art SW library (SIMD optimized etc) it could be detrimental to the runtime.
We currnetly use SSW for SW alignment, but I think WFA could be explored.
I should also mention that I chose kseq++ for reading writing fastq that was shown to be fast.
SeqAn is a great library and I’m positive its performance would be competitive. But of course I cannot be sure. In any case, I agree it would be a major undertaking and probably increase compile times. The SAM record class this issue is about is probably just 200 lines of relatively straightforward code. I estimate that most of the work is actually refactoring the sam_string.append
lines in aln.cpp
so that they use this new class, and that work needs to be done no matter which writer we use.
And to clarify: I think that it would be totally appropriate that StrobeAlign does not gain support for BAM or CRAM. In practice, one wants to produce sorted BAM/CRAM files anyway, so the output should be piped into samtools sort
. Nowadays, samtools sort
understands both SAM and BAM, so in terms user friendliness, it does not make a difference. There may be a performance benefit of writing (uncompressed) BAM, so that could be something to discuss.
The support of writing bam directly is interesting. When an aligner (at least strobealign) is used with many cores, write I/O starts to be noticeable or even dominate (say 16 cores or more).
Supporting uncompressed BAM should also not be too hard once the SAM writing code is factored out (#61), even if we write the code for it ourselves (it’s a relatively simple binary format). Let me open a new issue for this.
... and there may also be ways of speeding up writing SAM. My feeling is that it should not be a bottleneck if optimized a little bit.
To solve #31 and #56, we should create an
AlignmentRecord
or so class (or something like that). The idea is that objects of this class represent the information from one row in a SAM/BAM file, and that there would be a method that writes out the data in SAM format. All the functions that want to write SAM output then would create one instance per row that they want to write and let the class do the formatting. That way, the formatting (serialization) is done in one place only and could, at least in theory, be much more easily switched for something else, for example, if we wanted to support writing BAM or CRAM (not that I suggest this, I think writing SAM is totally fine).Possibly some kind of
SamFile
or so class could be involved in order to store the read group ID that we want to add as part of #31.