ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
145 stars 17 forks source link

Create classes that represent SAM files and records #58

Closed marcelm closed 2 years ago

marcelm commented 2 years ago

To solve #31 and #56, we should create an AlignmentRecord or so class (or something like that). The idea is that objects of this class represent the information from one row in a SAM/BAM file, and that there would be a method that writes out the data in SAM format. All the functions that want to write SAM output then would create one instance per row that they want to write and let the class do the formatting. That way, the formatting (serialization) is done in one place only and could, at least in theory, be much more easily switched for something else, for example, if we wanted to support writing BAM or CRAM (not that I suggest this, I think writing SAM is totally fine).

Possibly some kind of SamFile or so class could be involved in order to store the read group ID that we want to add as part of #31.

marcelm commented 2 years ago

So instead of this:

https://github.com/ksahlin/StrobeAlign/blob/e7fc8f5764abd753092d7e781fc06ff9f7128e68/src/aln.cpp#L1313-L1320

It would look something like this (not sure if this would work):

if (all_nams.size() == 0) {
    sam_file.append(SamRecord::unmapped(query_acc, read, qual));
    return;
}
PaulPyl commented 2 years ago

Have we excluded using an external library? I know depending on extend this might end up as a major rewrite, but something like e.g. https://www.seqan.de/ would give us all the I/O, data-structures for reads and alignments and a whole bunch of algorithms like banded Smith Waterman or the like in it which might be useful. though it is a pretty hefty thing to include in a project.

ksahlin commented 2 years ago

I am scared such changes because (i) extra external libraries/compilation and particularly (ii) speed. I am not an expert on C++, but what I did spent great care with for this tool was to optimize speed to the extent that I possibly could. Unless seqan has a state-of-the-art SW library (SIMD optimized etc) it could be detrimental to the runtime.

We currnetly use SSW for SW alignment, but I think WFA could be explored.

ksahlin commented 2 years ago

I should also mention that I chose kseq++ for reading writing fastq that was shown to be fast.

marcelm commented 2 years ago

SeqAn is a great library and I’m positive its performance would be competitive. But of course I cannot be sure. In any case, I agree it would be a major undertaking and probably increase compile times. The SAM record class this issue is about is probably just 200 lines of relatively straightforward code. I estimate that most of the work is actually refactoring the sam_string.append lines in aln.cpp so that they use this new class, and that work needs to be done no matter which writer we use.

And to clarify: I think that it would be totally appropriate that StrobeAlign does not gain support for BAM or CRAM. In practice, one wants to produce sorted BAM/CRAM files anyway, so the output should be piped into samtools sort. Nowadays, samtools sort understands both SAM and BAM, so in terms user friendliness, it does not make a difference. There may be a performance benefit of writing (uncompressed) BAM, so that could be something to discuss.

ksahlin commented 2 years ago

The support of writing bam directly is interesting. When an aligner (at least strobealign) is used with many cores, write I/O starts to be noticeable or even dominate (say 16 cores or more).

marcelm commented 2 years ago

Supporting uncompressed BAM should also not be too hard once the SAM writing code is factored out (#61), even if we write the code for it ourselves (it’s a relatively simple binary format). Let me open a new issue for this.

marcelm commented 2 years ago

... and there may also be ways of speeding up writing SAM. My feeling is that it should not be a bottleneck if optimized a little bit.