Refference segments' coordinates probelem

AlAuAu commented 1 year ago

Hi,when I read the paper as metioned in the README,I'm a little confused about the coordinate of the reference segment (M.rs − M.qe, M.re + (|q| − M.qe)). How do I understand it in Biology ？I also tried to understand this by reading the code,but in the code, the projected_ref_start represents as M.rs-M.qs. They look similary but different. In addition, I'm also confused about why we choose local alignment method ssw instead of global alignment method.Did it performance better in the experiment? Or some biological reason I don‘t know. Thank you for considering my issues.

ksahlin commented 1 year ago

Hi @AlAuAu ,

First the main idea: the aim is to extract a large enough surrounding region in the reference to which we can align the read, allowing for indels in the read (particularly deletions in the read is the reason behind it).

The anchor between the query and reference may only cover part of the query. This is why we need to take 'some extra sequence' at each end of the reference. There is a tradeoff on how large a piece of the reference we take. We cannot take the whole reference because it would be extremely slow to align the read to. So we take a little extra from the beginning and end: the − M.qe is the extra start, and the (|q| − M.qe) is the extra end. You might have expected − M.qs instead of − M.qe, but this would lead to no 'extra sequence' in the reference to be aligned to in case there was a deletion in the read, and it would not align though the deletion. There are no right or wrong values here. The more at each end we take, the larger indels we can align through, but the slower the speed.

As answered above, because we take extra sequences in ends, we have to do either local or semi-global alignment to the reference (we don't want to penalize missing reference ends). Local alignment is beneficial, not forcing the read to be fully aligned if it doesn't fit due to, e.g., an insertion.

Also, from v0.8.0: I am not sure these extended value are correct, since we now partition the alignment differently. Looking quickly in https://github.com/ksahlin/strobealign/blob/main/src/aln.cpp#L227 we infer exact projected reference coordinates (https://github.com/ksahlin/strobealign/blob/main/src/aln.cpp#L255) and then simply take 50nt extra in each end. @marcelm can fill in if I am wrong.

marcelm commented 1 year ago

and then simply take 50nt extra in each end.

Yes, that’s how it’s done. So essentially: Assume ungapped alignment and extend the hit so that it covers the full query from beginning to end, then take 50 bp extra on each end on the reference and align the full query against it.

I'm also confused about why we choose local alignment method ssw instead of global alignment method.

Kristoffer said this already, but maybe as an additional point: local alignment allows for soft clipping. This allows to skip extra, non-template sequence in the query such as untrimmed adapters.

And just as a nit pick: Note that global alignment would be to align two sequences from end to end (the entire reference against the query). The alternative to local alignment we mean is semi-global or "ends-free" alignment.

AlAuAu commented 1 year ago

Hi Kristoffer, marcelm, Thank you for your patient answer,which is very helpful for me.

marcelm commented 1 year ago

Great that we were able to help!

Do you think the issue can be closed now?

AlAuAu commented 1 year ago

Of course,hope you have a nice day!

marcelm commented 1 year ago

Great, please get back to us if you have more questions!

ksahlin / strobealign

Refference segments' coordinates probelem #283