Closed marcelm closed 9 months ago
For human, the overallocation with respect to expected density is 2*(3,000,000,000)^0.5*16Bytes < 2Mb
, so I think it won't be costly and at the same time, unlikely that it is an underestimate.
When posting the above to-do list, I had already seen that memory usage increased for CHM13 by ~700 MB, but as it turns out (see https://github.com/ksahlin/strobealign/pull/313#issuecomment-1599408339), this is actually from the XXH64 syncmer PR #313. This PR does not increase memory usage at all.
Could it be because -b
tips over from 28 to 29? log2(600000000) = 29.16
I tested this on a couple of randomly picked genomes. It seems the estimate is typically very close to the truth, but sometimes too low by a tiny bit.
genome | size (bp) | actual no. of randstrobes | percentage of estimate |
---|---|---|---|
phix | 5386 | 1008 | 88.96% |
CHM13 | 3117292070 | 623250666 | 99.95% |
drosophila | 143724894 | 28502129 | 99.07% |
maize | 2182075994 | 435944291 | 99.87% |
rye | 7287855879 | 1454354249 | 99.77% |
GCF_000001735.4_TAIR10.1 | 119668634 | 23896908 | 99.75% |
GCF_000002035.6_GRCz11 | 1679203469 | 335501420 | 99.87% |
GCF_000182925.2_NC12 | 41102378 | 8214635 | 99.77% |
GCF_000208745.1_Criollo_cocoa | 324879930 | 61211821 | 94.15% |
GCF_003254395.2_Amel_HAv3.1 | 225250884 | 44822835 | 99.43% |
GCF_015832195.1_Motacilla_alba_V1.0_pri | 1072670728 | 214263853 | 99.84% |
GCF_020379485.1_Da-Ae | 1001886053 | 200287854 | 99.92% |
GCF_020520425.1_BTI_SOV_V1 | 894736656 | 179085237 | 100.04% |
GCF_024323335.1_CAAS_Psat_ZW6_1.0 | 3796066963 | 757907363 | 99.81% |
GCF_024334085.1_ASM2433408v1 | 147838017 | 29550213 | 99.86% |
GCF_026230105.1_icDioSubl1.1 | 456219552 | 91418967 | 100.14% |
GCF_029448725.1_ASM2944872v1 | 2496185110 | 498778710 | 99.89% |
GCF_900626175.2_cs10 | 876147649 | 147476086 | 84.13% |
GCF_922984935.1_mMelMel3.1 | 2738694446 | 547924058 | 100.01% |
GCF_947563725.1_qqArgBrue1.1 | 1778384155 | 355963536 | 100.06% |
Ah okay, since you measured the 'extra constant' C has more or less no impact on memory compared to XXH update, I guess we can increase this constant? How about 4*(len(ref))^0.5
?
To get all the percentages above below 100, we would need at least $9\cdot \sqrt{\text{len(ref)}}$. I’d go a bit higher just be on the safe side.
The worst is GCF_026230105.1_icDioSubl1.1
, for which the actual no. of randstrobes is 0.19% higher than the estimate without $C$. An alternative is to just add 0.5% or 1% to the estimate. So something like estimate = len(ref) / (k-s+1) * 1.01
Okay I see. The 1.01 multiplier sounds good.
I measured how runtime changes for this PR. We omit one costly step at index generation time, but that step was parallelized previously, so the savings depend on how many threads were used. If I run main
with 8 threads and compare, then this PR is between 10 and 15% faster. (If I run main
with just one thread, then it’s more like 55% faster, but I don’t think that would be an honest comparison.)
Note that #307, which is an alternative approach for speeding up index generation, would give a 2X speedup.
Looks good, approved.
I’m closing this PR because it conflicts with #307 (filling the randstrobes vector from multiple threads), which was merged already. We need exact randstrobe counts for #307.
I posted an alternative idea (which would need a new PR) in #340.
As discussed in #304.