ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Do not count randstrobes, but estimate their number #314

Closed marcelm closed 9 months ago

marcelm commented 1 year ago

As discussed in #304.

ksahlin commented 1 year ago

For human, the overallocation with respect to expected density is 2*(3,000,000,000)^0.5*16Bytes < 2Mb, so I think it won't be costly and at the same time, unlikely that it is an underestimate.

marcelm commented 1 year ago

When posting the above to-do list, I had already seen that memory usage increased for CHM13 by ~700 MB, but as it turns out (see https://github.com/ksahlin/strobealign/pull/313#issuecomment-1599408339), this is actually from the XXH64 syncmer PR #313. This PR does not increase memory usage at all.

ksahlin commented 1 year ago

Could it be because -btips over from 28 to 29? log2(600000000) = 29.16

marcelm commented 1 year ago

I tested this on a couple of randomly picked genomes. It seems the estimate is typically very close to the truth, but sometimes too low by a tiny bit.

genome size (bp) actual no. of randstrobes percentage of estimate
phix 5386 1008 88.96%
CHM13 3117292070 623250666 99.95%
drosophila 143724894 28502129 99.07%
maize 2182075994 435944291 99.87%
rye 7287855879 1454354249 99.77%
GCF_000001735.4_TAIR10.1 119668634 23896908 99.75%
GCF_000002035.6_GRCz11 1679203469 335501420 99.87%
GCF_000182925.2_NC12 41102378 8214635 99.77%
GCF_000208745.1_Criollo_cocoa 324879930 61211821 94.15%
GCF_003254395.2_Amel_HAv3.1 225250884 44822835 99.43%
GCF_015832195.1_Motacilla_alba_V1.0_pri 1072670728 214263853 99.84%
GCF_020379485.1_Da-Ae 1001886053 200287854 99.92%
GCF_020520425.1_BTI_SOV_V1 894736656 179085237 100.04%
GCF_024323335.1_CAAS_Psat_ZW6_1.0 3796066963 757907363 99.81%
GCF_024334085.1_ASM2433408v1 147838017 29550213 99.86%
GCF_026230105.1_icDioSubl1.1 456219552 91418967 100.14%
GCF_029448725.1_ASM2944872v1 2496185110 498778710 99.89%
GCF_900626175.2_cs10 876147649 147476086 84.13%
GCF_922984935.1_mMelMel3.1 2738694446 547924058 100.01%
GCF_947563725.1_qqArgBrue1.1 1778384155 355963536 100.06%
ksahlin commented 1 year ago

Ah okay, since you measured the 'extra constant' C has more or less no impact on memory compared to XXH update, I guess we can increase this constant? How about 4*(len(ref))^0.5 ?

marcelm commented 1 year ago

To get all the percentages above below 100, we would need at least $9\cdot \sqrt{\text{len(ref)}}$. I’d go a bit higher just be on the safe side.

The worst is GCF_026230105.1_icDioSubl1.1, for which the actual no. of randstrobes is 0.19% higher than the estimate without $C$. An alternative is to just add 0.5% or 1% to the estimate. So something like estimate = len(ref) / (k-s+1) * 1.01

ksahlin commented 1 year ago

Okay I see. The 1.01 multiplier sounds good.

marcelm commented 1 year ago

I measured how runtime changes for this PR. We omit one costly step at index generation time, but that step was parallelized previously, so the savings depend on how many threads were used. If I run main with 8 threads and compare, then this PR is between 10 and 15% faster. (If I run main with just one thread, then it’s more like 55% faster, but I don’t think that would be an honest comparison.)

Note that #307, which is an alternative approach for speeding up index generation, would give a 2X speedup.

ksahlin commented 1 year ago

Looks good, approved.

marcelm commented 9 months ago

I’m closing this PR because it conflicts with #307 (filling the randstrobes vector from multiple threads), which was merged already. We need exact randstrobe counts for #307.

I posted an alternative idea (which would need a new PR) in #340.