ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
141 stars 17 forks source link

Does --interleaved need to support mixed input? #274

Open marcelm opened 1 year ago

marcelm commented 1 year ago

Hi @luispedro. I was just looking into #273, which is about the --interleaved option. It currently allows mixing single ends with paired ends. One issue with the current implementation is that the reads are reordered in the output (within each chunk, pairs come first, then singles), which is a bit unexpected from a user’s point of view. It’s also a bit unexpected that --interleaved allows this mixing at all. While I was planning how to fix the reordering and how to better explain how --interleaved works, I started wondering whether this type of mixed input is actually something that should be supported.

So my question: Are you actually using this or do you know anyone relying on this behavior? Because the easiest fix would be to have --interleaved mean only what it says and just disallow mixed inputs.

cf #213

luispedro commented 1 year ago

For my intention of using this as a replacement for bwa mem in NGLess, we do need the mixed-format.

It is very common that real datasets are in this format (often because they start out as paired-end and, due to QC, some sequences lose their mate). Technically, the main advantage is that we can then stream the reads as a pipe and, otherwise, we need to create the files on disk.

I would strongly prefer to keep compatibility with bwa mem -p as much as possible. I can check how it handles the /1-/2 suffixes.

marcelm commented 1 year ago

I would strongly prefer to keep compatibility with bwa mem -p as much as possible.

Absolutely. If this is something that you rely on in practice, then of course this behavior should be kept. We do need to document it a bit better, though.