ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Choose random mapping locations for multimappers #364

Closed marcelm closed 7 months ago

marcelm commented 7 months ago

This is an alternative to #360 that works by shuffling the top NAMs that have the same score as the best one.

See #359

samdiff.py outputs this when comparing to main:

       9643 reads were unmapped before and after
          0 reads became mapped
          0 reads became unmapped
      82741 reads were mapped to same locus before and after
*      7421 reads were multimapper before and after, same alignment score (AS)
*        17 reads were multimapper before and after, better alignment score (AS)
*         5 reads were multimapper before and after, worse alignment score (AS)
*       173 reads changed in another way
     100000 total reads

The high number of reads that get the same alignment score before and after is a sign that this works as intended. There are some reads that get a different score, but this is as expected and the price to pay for this faster variant that only shuffles the NAMs.

Accuracy changes in an unremarkable way (I ran this comparison only on a subset of the datasets):

dataset main (8c74210) this PR difference
drosophila-50-se 82.5175 82.5108 -0.0067
drosophila-100-se 89.2626 89.255 -0.0076
drosophila-150-se 90.937 90.9448 +0.0078
drosophila-200-se 91.9463 91.9559 +0.0096
drosophila-300-se 93.2153 93.2187 +0.0034
maize-50-se 47.4622 47.4417 -0.0205
maize-100-se 70.5148 70.503 -0.0118
maize-200-se 86.7001 86.7083 +0.0082
maize-300-se 91.8516 91.8522 +0.0006
CHM13-50-se 81.7231 81.7321 +0.0090
CHM13-100-se 90.6468 90.6401 -0.0067
CHM13-150-se 92.4066 92.4081 +0.0015
CHM13-200-se 93.221 93.2167 -0.0043
CHM13-300-se 94.1508 94.1472 -0.0036

Average difference se: -0.0015

We don’t have a way at the moment to test whether this results in the intended outcome, that is, whether locations for multimappers are less biased.

ksahlin commented 7 months ago

Awesome! I will check on the 10 E. coli genomes dataset that I generated before. (also pinging @psj1997 @luispedro FYI)

ksahlin commented 7 months ago

Oh, I realize that this is currently only for SE. I will do the check in SE mode on the 10 E. colis.

marcelm commented 7 months ago

One could do the exact same thing with the list of NAMs in paired-end mode. Do you think this could help a little bit until we have something smarter?

marcelm commented 7 months ago

One could do the exact same thing with the list of NAMs in paired-end mode. Do you think this could help a little bit until we have something smarter?

I added that now. It seems to do something ...:

      17339 reads were unmapped before and after
*         5 reads became mapped
*         2 reads became unmapped
     171879 reads were mapped to same locus before and after
*      9761 reads were multimapper before and after, same alignment score (AS)
*        22 reads were multimapper before and after, better alignment score (AS)
*        11 reads were multimapper before and after, worse alignment score (AS)
*       981 reads changed in another way
     200000 total reads
ksahlin commented 7 months ago

I first compared a30d779 (strobealign-random) with v0.11.0 (denoted strobealign). I accidentally ran in PE mode, the uniformity was still as biased as in v0.11.0 (as expected). However, I noticed a small change in accuracy, stats below. Is this expected?

Screenshot 2023-11-16 at 18 02 34

Then you posted commit 57fe7df so I tried that too (strobealign-random). Here are the accuracy stats:

Screenshot 2023-11-16 at 18 09 18

The bump in accuracy is quite bizarre if you ask me. This is a subset of 10k reads so sample size is small, however it is nearly 1% improvement. I have to test on a larger dataset to see if it evens out.


Now to what I was actually gonna test (the randomness). The name of the game is to get close to 0.945-0.950 in the last column (based on bowtie2 and BWA's distributions). This column measures "fraction of bases on chromosome with depth equal to column 2 (i.e., 0 in our case)". The true numbers for each genome are unknown. However, since reads were simulated at random, I expect then to be the same if there is no mapping bias.

As we can see the random version improves the numbers a lot - although not as even as BWA/Bt2 (but we do obtain higher accuracy than both of them on this small dataset).

v0.11.0

$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.strobealign_default.bam | grep -E "[0-9]+\t0\t"
0   0   4900606 5269732 0.929954
3   0   4878646 5217379 0.935076
5   0   5009761 5270094 0.950602
8   0   4955499 5255484 0.94292
11  0   5051905 5214647 0.968791
15  0   5370618 5703439 0.941646
18  0   5510964 5730399 0.961707
21  0   5139976 5536811 0.928328
26  0   5250803 5668289 0.926347
32  0   5674359 5770266 0.983379

commit 57fe7df

$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.strobealign_random.bam | grep -E "[0-9]+\t0\t"
0   0   4968465 5269732 0.942831
3   0   4928918 5217379 0.944712
5   0   4989804 5270094 0.946815
8   0   4958396 5255484 0.943471
11  0   4946985 5214647 0.948671
15  0   5334311 5703439 0.93528
18  0   5464417 5730399 0.953584
21  0   5227056 5536811 0.944055
26  0   5396273 5668289 0.952011
32  0   5511444 5770266 0.955146

BWA-MEM

$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.bwa_mem.bam | grep -E "[0-9]+\t0\t"
0   0   4979591 5269732 0.944942
3   0   4928776 5217379 0.944684
5   0   4981872 5270094 0.94531
8   0   4988722 5255484 0.949241
11  0   4930204 5214647 0.945453
15  0   5411526 5703439 0.948818
18  0   5431322 5730399 0.947809
21  0   5234316 5536811 0.945367
26  0   5364423 5668289 0.946392
32  0   5466397 5770266 0.947339

Bowtie2

$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.bowtie2.bam | grep -E "[0-9]+\t0\t"
0   0   4981502 5269732 0.945305
3   0   4929527 5217379 0.944828
5   0   4988778 5270094 0.94662
8   0   4978964 5255484 0.947384
11  0   4927270 5214647 0.94489
15  0   5398555 5703439 0.946544
18  0   5431461 5730399 0.947833
21  0   5251857 5536811 0.948535
26  0   5360894 5668289 0.945769
32  0   5469114 5770266 0.94781

(I also checked the distribution in SE mode and it evend out the distribution substantially (I don't have the BWA/Bt2 comparison stats) but I beleive it works as expected.

ksahlin commented 7 months ago

I couldn't believe the accuracy increase, and rightfully so. I simulated three replicate instances with 100k PE reads in each instance, below are the PE alignment results. There seems to be no noticeable effect on accuracy, which we expect, I guess.

I will to the same analysis for 100 genomes (I want to check the effect as more repeated NAMs are introduced)

Screenshot 2023-11-16 at 18 50 20
ksahlin commented 7 months ago

The read simulator complained about the 100 ecoli dataset, but here are the results for 50 ecoli genomes.

Screenshot 2023-11-16 at 18 59 37

I also checked the read placement randomness (coverage uniformity across genomes). Looks much better - centered around 0.88-0.89 (below I dump the data for that for documentation - sry).

I think we can merge this. What do you say? We still have the potential improvement in PE NAM pairing that can be added before the random shuffle, but for now this version should be an improvement over baseline.

(devel) math172l2:OBJ_C ksahlin$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_default.bam | grep -E "[0-9]+\t0\t"
0   0   4425709 5269732 0.839836
3   0   4785792 5217379 0.917279
5   0   4729574 5270094 0.897436
8   0   4642790 5255484 0.883418
11  0   4801312 5214647 0.920736
15  0   4957996 5703439 0.869299
18  0   5266074 5730399 0.918972
21  0   5163049 5536811 0.932495
26  0   5181816 5668289 0.914176
32  0   5296838 5770266 0.917954
37  0   5288767 5767030 0.917069
41  0   5232706 5747479 0.910435
44  0   5150896 5770126 0.892683
47  0   5540245 5833706 0.949696
53  0   5184368 5722815 0.905912
58  0   4758577 5366488 0.886721
62  0   4831584 5226109 0.924509
65  0   4628852 5029720 0.9203
72  0   4629632 5061873 0.914608
75  0   4498989 5110834 0.880285
80  0   4732493 5132219 0.922114
85  0   4712882 5255408 0.896768
87  0   4542237 5111512 0.888629
89  0   4415368 5062632 0.872149
92  0   4291407 5079315 0.844879
95  0   4636143 5197709 0.891959
99  0   4407837 5041928 0.874236
102 0   4614709 5138060 0.898142
108 0   4492315 5138043 0.874324
114 0   4749347 5147947 0.922571
122 0   4628475 5029969 0.92018
126 0   4582979 5103692 0.897973
132 0   4076951 5098211 0.799683
133 0   4464208 5098635 0.875569
135 0   4519142 5255987 0.859808
137 0   4671210 5256017 0.888736
146 0   4708633 5099979 0.923265
147 0   4636300 5085570 0.911658
148 0   4624204 5099806 0.906741
149 0   4629665 5088968 0.909745
151 0   4668302 5089992 0.917153
153 0   4516693 5088968 0.887546
155 0   4759101 5285418 0.900421
159 0   4673995 5072410 0.921454
161 0   4717066 5060072 0.932213
163 0   4794753 5090382 0.941924
165 0   4679791 5088970 0.919595
167 0   4532985 5285512 0.857625
169 0   4851099 5285097 0.917883
171 0   4387097 5284087 0.830247
(devel) math172l2:OBJ_C ksahlin$ samtools view -b /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.sam | samtools sort -o /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.bam
(devel) math172l2:OBJ_C ksahlin$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.bam | grep -E "[0-9]+\t0\t"
0   0   4683567 5269732 0.888768
3   0   4702112 5217379 0.90124
5   0   4709841 5270094 0.893692
8   0   4632677 5255484 0.881494
11  0   4728239 5214647 0.906723
15  0   4962045 5703439 0.870009
18  0   5287916 5730399 0.922783
21  0   5061211 5536811 0.914102
26  0   5200693 5668289 0.917507
32  0   5280821 5770266 0.915178
37  0   5306395 5767030 0.920126
41  0   5262653 5747479 0.915645
44  0   5176862 5770126 0.897184
47  0   5383307 5833706 0.922794
53  0   5259823 5722815 0.919097
58  0   4761994 5366488 0.887358
62  0   4776485 5226109 0.913966
65  0   4493271 5029720 0.893344
72  0   4590887 5061873 0.906954
75  0   4497195 5110834 0.879934
80  0   4639345 5132219 0.903965
85  0   4705894 5255408 0.895438
87  0   4568653 5111512 0.893797
89  0   4399648 5062632 0.869044
92  0   4248203 5079315 0.836373
95  0   4609376 5197709 0.886809
99  0   4386183 5041928 0.869942
102 0   4571898 5138060 0.88981
108 0   4535183 5138043 0.882667
114 0   4673614 5147947 0.90786
122 0   4565770 5029969 0.907713
126 0   4563826 5103692 0.89422
132 0   4550035 5098211 0.892477
133 0   4472361 5098635 0.877168
135 0   4533234 5255987 0.86249
137 0   4624282 5256017 0.879807
146 0   4680840 5099979 0.917816
147 0   4626446 5085570 0.90972
148 0   4601493 5099806 0.902288
149 0   4630542 5088968 0.909918
151 0   4658783 5089992 0.915283
153 0   4560742 5088968 0.896202
155 0   4674017 5285418 0.884323
159 0   4635390 5072410 0.913844
161 0   4657656 5060072 0.920472
163 0   4684726 5090382 0.920309
165 0   4675791 5088970 0.918809
167 0   4610326 5285512 0.872257
169 0   4702754 5285097 0.889814
171 0   4606436 5284087 0.871756
ksahlin commented 7 months ago

Finally, I also checked that in mapping mode -x, the accuracy stayed the same (statistically) the randomness improved.

First, in the 50 genomes dataset the true highest and lowest number of reads simulated from the genomes are and 4,530 and 3,680, respectively.

Counting the number of reads mapping to each of the 50 genomes, I observed 8699 and 1974 reads mapped to highest and lowest genome for v0.11.0, respectively.

For this PR I observed 6913 and 2788, respectively.

So a big improvement but still quite far of the true values. However, we need to be careful not to get too disappointed here because statistical variation comes into play (too tired to derive how large it is, but the upper bound - all reads completely random - probably follows some common distribution).

marcelm commented 7 months ago

However, I noticed a small change in accuracy, stats below. Is this expected?

Yes, that change is from #336.

Here is the change in accuracy just from this PR.

Comparing accuracy

dataset 8c74210 57fe7df difference
drosophila-50-se 82.5175 82.5108 -0.0067
drosophila-75-se 87.9797 88.033 +0.0533
drosophila-100-se 89.2626 89.255 -0.0076
drosophila-150-se 90.937 90.9448 +0.0078
drosophila-200-se 91.9463 91.9559 +0.0096
drosophila-300-se 93.2153 93.2187 +0.0034
drosophila-500-se 94.5722 94.5586 -0.0136
maize-50-se 47.4622 47.4417 -0.0205
maize-75-se 61.8938 61.9045 +0.0107
maize-100-se 70.5148 70.503 -0.0118
maize-150-se 81.1606 81.154 -0.0066
maize-200-se 86.7001 86.7083 +0.0082
maize-300-se 91.8516 91.8522 +0.0006
maize-500-se 95.3796 95.4038 +0.0242
CHM13-50-se 81.7231 81.7321 +0.0090
CHM13-75-se 88.9376 88.9318 -0.0058
CHM13-100-se 90.6468 90.6401 -0.0067
CHM13-150-se 92.4066 92.4081 +0.0015
CHM13-200-se 93.221 93.2167 -0.0043
CHM13-300-se 94.1508 94.1472 -0.0036
CHM13-500-se 95.0531 95.0499 -0.0032
rye-50-se 44.7328 44.7464 +0.0136
rye-75-se 60.236 60.2692 +0.0332
rye-100-se 69.3509 69.345 -0.0059
rye-150-se 80.4219 80.412 -0.0099
rye-200-se 85.9117 85.9279 +0.0162
rye-300-se 90.5181 90.5152 -0.0029
rye-500-se 93.5172 93.5133 -0.0039
drosophila-50-pe 90.1814 90.17505 -0.0063
drosophila-75-pe 91.6055 91.62985 +0.0243
drosophila-100-pe 92.38555 92.37945 -0.0061
drosophila-150-pe 93.2124 93.1991 -0.0133
drosophila-200-pe 93.5107 93.5143 +0.0036
drosophila-300-pe 95.36815 95.37105 +0.0029
drosophila-500-pe 95.70805 95.673 -0.0350
maize-50-pe 71.48315 71.48905 +0.0059
maize-75-pe 82.1135 82.1222 +0.0087
maize-100-pe 87.1423 87.1474 +0.0051
maize-150-pe 91.68815 91.6918 +0.0037
maize-200-pe 92.9413 92.93575 -0.0055
maize-300-pe 96.71515 96.71725 +0.0021
maize-500-pe 97.2744 97.29125 +0.0168
CHM13-50-pe 90.648 90.64065 -0.0073
CHM13-75-pe 92.51265 92.51435 +0.0017
CHM13-100-pe 93.2113 93.22665 +0.0154
CHM13-150-pe 94.1508 94.1352 -0.0156
CHM13-200-pe 94.4201 94.4225 +0.0024
CHM13-300-pe 95.62845 95.63475 +0.0063
CHM13-500-pe 95.9551 95.97645 +0.0213
rye-50-pe 69.146 69.185 +0.0390
rye-75-pe 80.5645 80.59005 +0.0255
rye-100-pe 85.6234 85.59965 -0.0237
rye-150-pe 90.23185 90.22415 -0.0077
rye-200-pe 91.4807 91.47285 -0.0079
rye-300-pe 94.55785 94.56635 +0.0085
rye-500-pe 95.15445 95.16665 +0.0122

Average difference se: +0.0028 Average difference pe: +0.0027

marcelm commented 7 months ago

As I suggested by e-mail, I combined this PR with #360.

PR #360 was problematic because it would not allow the optimization that one can stop doing alignments if an exact match has been found. This is taken care of by NAM shuffling: Assuming that two exact matches also give the exact NAM score, we pick one of these exact matches randomly.

However, if the first match is not an exact one, we do compute alignments until the dropoff is reached or the maximum number of sites has been tried. So we do that already and can, without additional cost, do the exact version from #360.

This changes the samdiff output thus:

       9643 reads were unmapped before and after
          0 reads became mapped
          0 reads became unmapped
      80441 reads were mapped to same locus before and after
*      9716 reads were multimapper before and after, same alignment score (AS)
*        14 reads were multimapper before and after, better alignment score (AS)
*         4 reads were multimapper before and after, worse alignment score (AS)
*       182 reads changed in another way
     100000 total reads

So instead of 7421 (see output pasted somewhere above), now 9716 multimappers are affected.

I also reproduced what you did on the 50 E. coli reference. I used different genomes and simulated 1 million reads, so the numbers are different (should be somewhat reproducible now, will document later). Here is the last column of bedtools genomecov (for those rows where the second column is 0) plotted for the various program versions ("shuffle-more" is the mentioned change):

genomes

ksahlin commented 7 months ago

This is great!

Did you see any noticeable slowdown?

Would this approach work for PE?

marcelm commented 7 months ago

Did you see any noticeable slowdown?

It did not become measurably slower.

Would this approach work for PE?

I think something similar could work; I will look into it.

marcelm commented 7 months ago

So this PR now

I would like to suggest that we merge this PR with the three features above included and that I open a separate PR for adding random read pair selection for paired-end reads (so as not to make the discussion here too long).

ksahlin commented 7 months ago

Awesome! Approved to merge.