ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Support end-to-end alignments? #399

Open y9c opened 4 months ago

y9c commented 4 months ago

It seems that "penalty" for soft clip, which is specified by the -L argument is "reward" rather than "penalty". More specicailly, given the example below:

reference file is

>ref
ACTGGCATTAGTGGGACTTTTTTTTTTTTTTTTTTTAATGTTAAAAGTCCCACTAATGCCAGC

input fastq file is

@test
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTAATAGTCCATTCTGTTTCTTCAGAAAGGTAACACAAACACACTTAAACATTAA
+
I99III9IIIIIIIIIIIIIIIIIIIII99-----99--9--99I----99-9I-99-I9---II9---99--999----99

Command strobealign -t 16 -L 0 -k 10 ref.fa input.fq output mapping result as follow:

test    0       ref     18      0       10S22M50S (Score: AS:i:44)
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTAATAGTCCATTCTGTTTCTTCAGAAAGGTAACACAAACACACTTAAACATTAA
          ||||||||||||||||||||||
NNNNNNNNNNTTTTTTTTTTTTTTTTTTTAATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

While command strobealign -t 16 -L 100 -k 10 ref.fa input.fq output mapping result as follow:

test    0       ref     8       0       32M50S (Score: AS:i:94)
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTAATAGTCCATTCTGTTTCTTCAGAAAGGTAACACAAACACACTTAAACATTAA
||  |     ||||||||||||||||||||||
TTAGTGGGACTTTTTTTTTTTTTTTTTTTAATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

However, since the polyT region in the reads, which is complement to polyA should not be aligned into the internal region of the reference sequence. It means that end-to-end mapping is required for this example, and penalty for soft clipping is necessary. But the -L argument increases the align score rather than decrease it. Is is possible to do real "soft clip pentalty" for strobealign?

marcelm commented 4 months ago

The -L is indeed a reward and I have still on my to-do list to document this better. It’s called an end bonus. It should work like an end penalty, the only difference is that the logic is inverted: instead of reducing the score when there is soft clipping, we add to the score when there is no soft clipping.

In your example, the query is longer than the reference, so there’s necessarily soft clipping. How do you want the alignment to look?

y9c commented 4 months ago

Thank you for the reply. I want to drop these reads in the alignment output, since they are mapping errors and would significantly overestimate the mapping rate. I can filter the AS tag manually, but it seems that -L argument is not working for clipping penalty. If the value is less than zero, the strobealign ignores the setting.

marcelm commented 4 months ago

Would you filter out negative AS values? I think you can do the equivalent here: Use -L 10000 and then filter alignments with AS < 10000. Regular alignments without any soft clipping should get AS around 20000 (because there are two end bonuses) and alignments with one end soft clipped get AS around 10000.

y9c commented 4 months ago

will there be some side effects? If AS is the large, how can we ensure the resolution of -A and -B arguments.

y9c commented 4 months ago

Is it possible to do end to end mapping using strobealign

https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment

marcelm commented 4 months ago

will there be some side effects? If AS is the large, how can we ensure the resolution of -A and -B arguments.

It should be fine. Since scores are only ever added up, the only consequence is that AS values are shifted and that soft-clipping is avoided if possible.

One limitation of soft clipping in strobealign is that the alignment library we use (StripedSmithWaterman) does not directly support an end bonus/soft clipping penalty, so we compute the alignment without end bonus/soft clipping penalty and then attempt to extend the alignment to the ends of the query to see whether that gives a better overall score when taking the end bonus into account.

Is it possible to do end to end mapping using strobealign

https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment

Probably not as long as we use StripedSmithWaterman. We have also considered switching to ksw2, which would suppor this better.

That said, my impression (which may be wrong) is that forcing end-to-end alignment is not the right thing to do here. It seems you’re attempting to use it as a kind of filter that gives you a very low AS when there would otherwise be soft clipping. Can’t you just directly filter by AS? Or directly filter out anything with soft clipping?

y9c commented 4 months ago

Thank you very much for the explanation. I was trying to filter out these reads using the AS tag, but it seems that soft clipping does not add penalty to this tag. But I will try to filter base one the alignment length over the query read length.

marcelm commented 3 months ago

@fplaza, You have shown some interest in end-to-end alignment as well. Can you say why you would want it (what is your use case)?

We have some code that we never added where we switched to using WFA2 for the alignment (see #229 and #251). WFA2 supports end-to-end alignment, which we didn’t want at the time, but if there’s a compelling reason for allowing end-to-end alignments, we could add a command-line flag that switches to WFA2. This needs to be considered carefully, however, because it would add quite some code for a relatively small feature that may not be used that often.

fplazaonate commented 3 months ago

Hi @marcelm ,

We have a legacy pipeline that maps reads (shotgun metagenomic data) against a gene catalogue. We use bowtie2 in end-to-end mode by trimming reads around 80-100bp to improve mapping at genes edges. We could perform local alignments by making sure that soft-trimming occurs at the edges of genes. Yet, we have never been really satisfied with this solution as it increases FP rates.

I also use end-to-end alignments when I map reads from a sample against its own assembly because I expect that reads will be fully aligned.

apcamargo commented 3 weeks ago

I also have a use case for end-to-end alignments. We are currently experimenting with different read mappers to align spacer sequences of CRISPR arrays to phage genomes to infer the bacterial hosts of phages (if we have a good match between the spacer and the phage genome, it is likely that that phage infected the bacteria that encoded the CRISPR array). The number of mismatches across the entire length of the spacer is used as a proxy for the host assignment reliability, so the end-to-end alignment is required.

Right now we are using BLAST to perform those alignments (as described in the "Host assignment" section here), but it is not really scaling to the amount of data we are processing. For now we are relying on Bowtie 1 because it allows us to have end-to-end alignments and to control the maximum number of mismatches, but moving to an actively maintained tool would be great.

Tagging @simroux, who is also interested in this.