Closed marcelm closed 7 months ago
Awesome! I will check on the 10 E. coli genomes dataset that I generated before. (also pinging @psj1997 @luispedro FYI)
Oh, I realize that this is currently only for SE. I will do the check in SE mode on the 10 E. colis.
One could do the exact same thing with the list of NAMs in paired-end mode. Do you think this could help a little bit until we have something smarter?
One could do the exact same thing with the list of NAMs in paired-end mode. Do you think this could help a little bit until we have something smarter?
I added that now. It seems to do something ...:
17339 reads were unmapped before and after
* 5 reads became mapped
* 2 reads became unmapped
171879 reads were mapped to same locus before and after
* 9761 reads were multimapper before and after, same alignment score (AS)
* 22 reads were multimapper before and after, better alignment score (AS)
* 11 reads were multimapper before and after, worse alignment score (AS)
* 981 reads changed in another way
200000 total reads
I first compared a30d779 (strobealign-random) with v0.11.0 (denoted strobealign). I accidentally ran in PE mode, the uniformity was still as biased as in v0.11.0 (as expected). However, I noticed a small change in accuracy, stats below. Is this expected?
Then you posted commit 57fe7df so I tried that too (strobealign-random). Here are the accuracy stats:
The bump in accuracy is quite bizarre if you ask me. This is a subset of 10k reads so sample size is small, however it is nearly 1% improvement. I have to test on a larger dataset to see if it evens out.
Now to what I was actually gonna test (the randomness). The name of the game is to get close to 0.945-0.950 in the last column (based on bowtie2 and BWA's distributions). This column measures "fraction of bases on chromosome with depth equal to column 2 (i.e., 0 in our case)". The true numbers for each genome are unknown. However, since reads were simulated at random, I expect then to be the same if there is no mapping bias.
As we can see the random version improves the numbers a lot - although not as even as BWA/Bt2 (but we do obtain higher accuracy than both of them on this small dataset).
v0.11.0
$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.strobealign_default.bam | grep -E "[0-9]+\t0\t"
0 0 4900606 5269732 0.929954
3 0 4878646 5217379 0.935076
5 0 5009761 5270094 0.950602
8 0 4955499 5255484 0.94292
11 0 5051905 5214647 0.968791
15 0 5370618 5703439 0.941646
18 0 5510964 5730399 0.961707
21 0 5139976 5536811 0.928328
26 0 5250803 5668289 0.926347
32 0 5674359 5770266 0.983379
commit 57fe7df
$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.strobealign_random.bam | grep -E "[0-9]+\t0\t"
0 0 4968465 5269732 0.942831
3 0 4928918 5217379 0.944712
5 0 4989804 5270094 0.946815
8 0 4958396 5255484 0.943471
11 0 4946985 5214647 0.948671
15 0 5334311 5703439 0.93528
18 0 5464417 5730399 0.953584
21 0 5227056 5536811 0.944055
26 0 5396273 5668289 0.952011
32 0 5511444 5770266 0.955146
BWA-MEM
$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.bwa_mem.bam | grep -E "[0-9]+\t0\t"
0 0 4979591 5269732 0.944942
3 0 4928776 5217379 0.944684
5 0 4981872 5270094 0.94531
8 0 4988722 5255484 0.949241
11 0 4930204 5214647 0.945453
15 0 5411526 5703439 0.948818
18 0 5431322 5730399 0.947809
21 0 5234316 5536811 0.945367
26 0 5364423 5668289 0.946392
32 0 5466397 5770266 0.947339
Bowtie2
$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.bowtie2.bam | grep -E "[0-9]+\t0\t"
0 0 4981502 5269732 0.945305
3 0 4929527 5217379 0.944828
5 0 4988778 5270094 0.94662
8 0 4978964 5255484 0.947384
11 0 4927270 5214647 0.94489
15 0 5398555 5703439 0.946544
18 0 5431461 5730399 0.947833
21 0 5251857 5536811 0.948535
26 0 5360894 5668289 0.945769
32 0 5469114 5770266 0.94781
(I also checked the distribution in SE mode and it evend out the distribution substantially (I don't have the BWA/Bt2 comparison stats) but I beleive it works as expected.
I couldn't believe the accuracy increase, and rightfully so. I simulated three replicate instances with 100k PE reads in each instance, below are the PE alignment results. There seems to be no noticeable effect on accuracy, which we expect, I guess.
I will to the same analysis for 100 genomes (I want to check the effect as more repeated NAMs are introduced)
The read simulator complained about the 100 ecoli dataset, but here are the results for 50 ecoli genomes.
I also checked the read placement randomness (coverage uniformity across genomes). Looks much better - centered around 0.88-0.89 (below I dump the data for that for documentation - sry).
I think we can merge this. What do you say? We still have the potential improvement in PE NAM pairing that can be added before the random shuffle, but for now this version should be an improvement over baseline.
(devel) math172l2:OBJ_C ksahlin$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_default.bam | grep -E "[0-9]+\t0\t"
0 0 4425709 5269732 0.839836
3 0 4785792 5217379 0.917279
5 0 4729574 5270094 0.897436
8 0 4642790 5255484 0.883418
11 0 4801312 5214647 0.920736
15 0 4957996 5703439 0.869299
18 0 5266074 5730399 0.918972
21 0 5163049 5536811 0.932495
26 0 5181816 5668289 0.914176
32 0 5296838 5770266 0.917954
37 0 5288767 5767030 0.917069
41 0 5232706 5747479 0.910435
44 0 5150896 5770126 0.892683
47 0 5540245 5833706 0.949696
53 0 5184368 5722815 0.905912
58 0 4758577 5366488 0.886721
62 0 4831584 5226109 0.924509
65 0 4628852 5029720 0.9203
72 0 4629632 5061873 0.914608
75 0 4498989 5110834 0.880285
80 0 4732493 5132219 0.922114
85 0 4712882 5255408 0.896768
87 0 4542237 5111512 0.888629
89 0 4415368 5062632 0.872149
92 0 4291407 5079315 0.844879
95 0 4636143 5197709 0.891959
99 0 4407837 5041928 0.874236
102 0 4614709 5138060 0.898142
108 0 4492315 5138043 0.874324
114 0 4749347 5147947 0.922571
122 0 4628475 5029969 0.92018
126 0 4582979 5103692 0.897973
132 0 4076951 5098211 0.799683
133 0 4464208 5098635 0.875569
135 0 4519142 5255987 0.859808
137 0 4671210 5256017 0.888736
146 0 4708633 5099979 0.923265
147 0 4636300 5085570 0.911658
148 0 4624204 5099806 0.906741
149 0 4629665 5088968 0.909745
151 0 4668302 5089992 0.917153
153 0 4516693 5088968 0.887546
155 0 4759101 5285418 0.900421
159 0 4673995 5072410 0.921454
161 0 4717066 5060072 0.932213
163 0 4794753 5090382 0.941924
165 0 4679791 5088970 0.919595
167 0 4532985 5285512 0.857625
169 0 4851099 5285097 0.917883
171 0 4387097 5284087 0.830247
(devel) math172l2:OBJ_C ksahlin$ samtools view -b /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.sam | samtools sort -o /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.bam
(devel) math172l2:OBJ_C ksahlin$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.bam | grep -E "[0-9]+\t0\t"
0 0 4683567 5269732 0.888768
3 0 4702112 5217379 0.90124
5 0 4709841 5270094 0.893692
8 0 4632677 5255484 0.881494
11 0 4728239 5214647 0.906723
15 0 4962045 5703439 0.870009
18 0 5287916 5730399 0.922783
21 0 5061211 5536811 0.914102
26 0 5200693 5668289 0.917507
32 0 5280821 5770266 0.915178
37 0 5306395 5767030 0.920126
41 0 5262653 5747479 0.915645
44 0 5176862 5770126 0.897184
47 0 5383307 5833706 0.922794
53 0 5259823 5722815 0.919097
58 0 4761994 5366488 0.887358
62 0 4776485 5226109 0.913966
65 0 4493271 5029720 0.893344
72 0 4590887 5061873 0.906954
75 0 4497195 5110834 0.879934
80 0 4639345 5132219 0.903965
85 0 4705894 5255408 0.895438
87 0 4568653 5111512 0.893797
89 0 4399648 5062632 0.869044
92 0 4248203 5079315 0.836373
95 0 4609376 5197709 0.886809
99 0 4386183 5041928 0.869942
102 0 4571898 5138060 0.88981
108 0 4535183 5138043 0.882667
114 0 4673614 5147947 0.90786
122 0 4565770 5029969 0.907713
126 0 4563826 5103692 0.89422
132 0 4550035 5098211 0.892477
133 0 4472361 5098635 0.877168
135 0 4533234 5255987 0.86249
137 0 4624282 5256017 0.879807
146 0 4680840 5099979 0.917816
147 0 4626446 5085570 0.90972
148 0 4601493 5099806 0.902288
149 0 4630542 5088968 0.909918
151 0 4658783 5089992 0.915283
153 0 4560742 5088968 0.896202
155 0 4674017 5285418 0.884323
159 0 4635390 5072410 0.913844
161 0 4657656 5060072 0.920472
163 0 4684726 5090382 0.920309
165 0 4675791 5088970 0.918809
167 0 4610326 5285512 0.872257
169 0 4702754 5285097 0.889814
171 0 4606436 5284087 0.871756
Finally, I also checked that in mapping mode -x
, the accuracy stayed the same (statistically) the randomness improved.
First, in the 50 genomes dataset the true highest and lowest number of reads simulated from the genomes are and 4,530 and 3,680, respectively.
Counting the number of reads mapping to each of the 50 genomes, I observed 8699 and 1974 reads mapped to highest and lowest genome for v0.11.0, respectively.
For this PR I observed 6913 and 2788, respectively.
So a big improvement but still quite far of the true values. However, we need to be careful not to get too disappointed here because statistical variation comes into play (too tired to derive how large it is, but the upper bound - all reads completely random - probably follows some common distribution).
However, I noticed a small change in accuracy, stats below. Is this expected?
Yes, that change is from #336.
Here is the change in accuracy just from this PR.
dataset | 8c74210 | 57fe7df | difference |
---|---|---|---|
drosophila-50-se | 82.5175 | 82.5108 | -0.0067 |
drosophila-75-se | 87.9797 | 88.033 | +0.0533 |
drosophila-100-se | 89.2626 | 89.255 | -0.0076 |
drosophila-150-se | 90.937 | 90.9448 | +0.0078 |
drosophila-200-se | 91.9463 | 91.9559 | +0.0096 |
drosophila-300-se | 93.2153 | 93.2187 | +0.0034 |
drosophila-500-se | 94.5722 | 94.5586 | -0.0136 |
maize-50-se | 47.4622 | 47.4417 | -0.0205 |
maize-75-se | 61.8938 | 61.9045 | +0.0107 |
maize-100-se | 70.5148 | 70.503 | -0.0118 |
maize-150-se | 81.1606 | 81.154 | -0.0066 |
maize-200-se | 86.7001 | 86.7083 | +0.0082 |
maize-300-se | 91.8516 | 91.8522 | +0.0006 |
maize-500-se | 95.3796 | 95.4038 | +0.0242 |
CHM13-50-se | 81.7231 | 81.7321 | +0.0090 |
CHM13-75-se | 88.9376 | 88.9318 | -0.0058 |
CHM13-100-se | 90.6468 | 90.6401 | -0.0067 |
CHM13-150-se | 92.4066 | 92.4081 | +0.0015 |
CHM13-200-se | 93.221 | 93.2167 | -0.0043 |
CHM13-300-se | 94.1508 | 94.1472 | -0.0036 |
CHM13-500-se | 95.0531 | 95.0499 | -0.0032 |
rye-50-se | 44.7328 | 44.7464 | +0.0136 |
rye-75-se | 60.236 | 60.2692 | +0.0332 |
rye-100-se | 69.3509 | 69.345 | -0.0059 |
rye-150-se | 80.4219 | 80.412 | -0.0099 |
rye-200-se | 85.9117 | 85.9279 | +0.0162 |
rye-300-se | 90.5181 | 90.5152 | -0.0029 |
rye-500-se | 93.5172 | 93.5133 | -0.0039 |
drosophila-50-pe | 90.1814 | 90.17505 | -0.0063 |
drosophila-75-pe | 91.6055 | 91.62985 | +0.0243 |
drosophila-100-pe | 92.38555 | 92.37945 | -0.0061 |
drosophila-150-pe | 93.2124 | 93.1991 | -0.0133 |
drosophila-200-pe | 93.5107 | 93.5143 | +0.0036 |
drosophila-300-pe | 95.36815 | 95.37105 | +0.0029 |
drosophila-500-pe | 95.70805 | 95.673 | -0.0350 |
maize-50-pe | 71.48315 | 71.48905 | +0.0059 |
maize-75-pe | 82.1135 | 82.1222 | +0.0087 |
maize-100-pe | 87.1423 | 87.1474 | +0.0051 |
maize-150-pe | 91.68815 | 91.6918 | +0.0037 |
maize-200-pe | 92.9413 | 92.93575 | -0.0055 |
maize-300-pe | 96.71515 | 96.71725 | +0.0021 |
maize-500-pe | 97.2744 | 97.29125 | +0.0168 |
CHM13-50-pe | 90.648 | 90.64065 | -0.0073 |
CHM13-75-pe | 92.51265 | 92.51435 | +0.0017 |
CHM13-100-pe | 93.2113 | 93.22665 | +0.0154 |
CHM13-150-pe | 94.1508 | 94.1352 | -0.0156 |
CHM13-200-pe | 94.4201 | 94.4225 | +0.0024 |
CHM13-300-pe | 95.62845 | 95.63475 | +0.0063 |
CHM13-500-pe | 95.9551 | 95.97645 | +0.0213 |
rye-50-pe | 69.146 | 69.185 | +0.0390 |
rye-75-pe | 80.5645 | 80.59005 | +0.0255 |
rye-100-pe | 85.6234 | 85.59965 | -0.0237 |
rye-150-pe | 90.23185 | 90.22415 | -0.0077 |
rye-200-pe | 91.4807 | 91.47285 | -0.0079 |
rye-300-pe | 94.55785 | 94.56635 | +0.0085 |
rye-500-pe | 95.15445 | 95.16665 | +0.0122 |
Average difference se: +0.0028 Average difference pe: +0.0027
As I suggested by e-mail, I combined this PR with #360.
PR #360 was problematic because it would not allow the optimization that one can stop doing alignments if an exact match has been found. This is taken care of by NAM shuffling: Assuming that two exact matches also give the exact NAM score, we pick one of these exact matches randomly.
However, if the first match is not an exact one, we do compute alignments until the dropoff is reached or the maximum number of sites has been tried. So we do that already and can, without additional cost, do the exact version from #360.
This changes the samdiff output thus:
9643 reads were unmapped before and after
0 reads became mapped
0 reads became unmapped
80441 reads were mapped to same locus before and after
* 9716 reads were multimapper before and after, same alignment score (AS)
* 14 reads were multimapper before and after, better alignment score (AS)
* 4 reads were multimapper before and after, worse alignment score (AS)
* 182 reads changed in another way
100000 total reads
So instead of 7421 (see output pasted somewhere above), now 9716 multimappers are affected.
I also reproduced what you did on the 50 E. coli reference. I used different genomes and simulated 1 million reads, so the numbers are different (should be somewhat reproducible now, will document later). Here is the last column of bedtools genomecov
(for those rows where the second column is 0) plotted for the various program versions ("shuffle-more" is the mentioned change):
This is great!
Did you see any noticeable slowdown?
Would this approach work for PE?
Did you see any noticeable slowdown?
It did not become measurably slower.
Would this approach work for PE?
I think something similar could work; I will look into it.
So this PR now
I would like to suggest that we merge this PR with the three features above included and that I open a separate PR for adding random read pair selection for paired-end reads (so as not to make the discussion here too long).
Awesome! Approved to merge.
This is an alternative to #360 that works by shuffling the top NAMs that have the same score as the best one.
See #359
samdiff.py
outputs this when comparing to main:The high number of reads that get the same alignment score before and after is a sign that this works as intended. There are some reads that get a different score, but this is as expected and the price to pay for this faster variant that only shuffles the NAMs.
Accuracy changes in an unremarkable way (I ran this comparison only on a subset of the datasets):
Average difference se: -0.0015
We don’t have a way at the moment to test whether this results in the intended outcome, that is, whether locations for multimappers are less biased.