ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Scoring of read pairs on same vs. different chromosome #321

Closed marcelm closed 10 months ago

marcelm commented 1 year ago

do you think it would make sense / be worth to score read pairs on the same chromosome (but far away) higher than reads on separate chromosomes? Currently I think there is only the binary 'close and proper' or 'not close and proper' classification.

Originally posted by @ksahlin in https://github.com/ksahlin/strobealign/issues/319#issuecomment-1611405859

marcelm commented 1 year ago

Intuitively, I would say yes, at least in the following sense: If there are two alternatives for mapping a read pair, then I would prefer one that places both on the same chromosome.

I guess it depends on our expectations of what can happen in a sample. I assume that a rearrangement that leads to the reads mapping far apart on the same chromosome is likelier than one that leads to them being on a different chromosome.

It’s not quite the same, but as an experiment, I increased the 10*sigma threshold to 100*sigma. Accuracy in the test dataset decreases slightly:

dataset 47cb5c2 100*sigma difference
drosophila-50 90.189 90.15075 -0.0383
drosophila-100 92.38785 92.3633 -0.0246
drosophila-200 93.52085 93.5193 -0.0015
drosophila-300 95.36085 95.3599 -0.0009
maize-50 71.4703 71.4495 -0.0208
maize-100 87.13175 87.11975 -0.0120
maize-200 92.92045 92.9166 -0.0039
maize-300 96.70835 96.70735 -0.0010
CHM13-50 90.635 90.61355 -0.0214
CHM13-100 93.21985 93.1981 -0.0218
CHM13-200 94.43405 94.41805 -0.0160

But then I don’t think our test data has this type of event, so it’s natural that this isn’t helpful.

ksahlin commented 1 year ago

I think we have detected a bug if my reasoning below is correct, nice!

The score was implemented so that the further away from mean the isize is, the lower score.

As seen here https://github.com/ksahlin/strobealign/blob/main/src/aln.cpp#L1185 read pairs that map more than 5*sigma away gets a score lower than disjoint reads. So in the scoring that happens here: https://github.com/ksahlin/strobealign/blob/main/src/aln.cpp#L1169 , we should probably have something like:

S = (double)a1.sw_score + (double)a2.sw_score + std::max( log(normal_pdf(x, mu, sigma)), -20 + epsilon_pair ); (log(normal_pdf(x, mu, sigma)) is always negative)

In the proposed correction above we have capped the penalty to that of disjoint reads, and we have added a small positive value epsilon_pair to prefer pairs mapping 'together' rather than on, e.g., different chromosomes. A second Idea would also be to play around with -20 to allow more than 5 stddevs away.

The reason I claim that our current scoring is a bug is because we will actually penalize proper but far away mates more than completely separated reads (if sw scores identical), so we will then prefer disjoint reads. This is likely bad for detecting larger deletions.

ksahlin commented 1 year ago

To add a bit of data, I get below penalties libraries for with mu=300, sigma=30 (which I think we use in our simulations?).

So, we get that proper pairs with distance higher than about 470 would be penalised more than individual best alignments (e.g. on different chromosomes).

Also, we set x = std::abs(a1.ref_start - a2.ref_start);. Is this the difference of 5' starts in both reads? If so, we penalise close-by pairs harshly.

x: 30 penalty: -44.820135914866825
x: 60 penalty: -36.320135914866825
x: 90 penalty: -28.820135914866828
x: 120 penalty: -22.320135914866828
x: 150 penalty: -16.820135914866828
x: 180 penalty: -12.320135914866828
x: 210 penalty: -8.820135914866828
x: 240 penalty: -6.320135914866828
x: 270 penalty: -4.820135914866828
x: 300 penalty: -4.320135914866828
x: 330 penalty: -4.820135914866828
x: 360 penalty: -6.320135914866828
x: 390 penalty: -8.820135914866828
x: 420 penalty: -12.320135914866828
x: 450 penalty: -16.820135914866828
x: 480 penalty: -22.320135914866828
x: 510 penalty: -28.820135914866828
x: 540 penalty: -36.320135914866825
x: 570 penalty: -44.820135914866825
x: 600 penalty: -54.320135914866825
x: 630 penalty: -64.82013591486682
x: 660 penalty: -76.32013591486682
x: 690 penalty: -88.82013591486682
x: 720 penalty: -102.32013591486682
x: 750 penalty: -116.82013591486682
x: 780 penalty: -132.32013591486682
x: 810 penalty: -148.82013591486682
x: 840 penalty: -166.32013591486682
x: 870 penalty: -184.82013591486682
x: 900 penalty: -204.32013591486682
x: 930 penalty: -224.82013591486682
x: 960 penalty: -246.32013591486682
x: 990 penalty: -268.82013591486685