Do not count randstrobes, but estimate their number

marcelm commented 1 year ago

As discussed in #304.

[x] Handle the case that the estimate is too low (for example, by counting to the end without appending to the vector, then starting over with the correct size; also print a warning)
[x] Measure index generation speedup
[x] Measure increase in memory usage and discuss whether this is acceptable

ksahlin commented 1 year ago

For human, the overallocation with respect to expected density is 2*(3,000,000,000)^0.5*16Bytes < 2Mb, so I think it won't be costly and at the same time, unlikely that it is an underestimate.

marcelm commented 1 year ago

When posting the above to-do list, I had already seen that memory usage increased for CHM13 by ~700 MB, but as it turns out (see https://github.com/ksahlin/strobealign/pull/313#issuecomment-1599408339), this is actually from the XXH64 syncmer PR #313. This PR does not increase memory usage at all.

ksahlin commented 1 year ago

Could it be because -btips over from 28 to 29? log2(600000000) = 29.16

marcelm commented 1 year ago

I tested this on a couple of randomly picked genomes. It seems the estimate is typically very close to the truth, but sometimes too low by a tiny bit.

genome	size (bp)	actual no. of randstrobes	percentage of estimate
phix	5386	1008	88.96%
CHM13	3117292070	623250666	99.95%
drosophila	143724894	28502129	99.07%
maize	2182075994	435944291	99.87%
rye	7287855879	1454354249	99.77%
GCF_000001735.4_TAIR10.1	119668634	23896908	99.75%
GCF_000002035.6_GRCz11	1679203469	335501420	99.87%
GCF_000182925.2_NC12	41102378	8214635	99.77%
GCF_000208745.1_Criollo_cocoa	324879930	61211821	94.15%
GCF_003254395.2_Amel_HAv3.1	225250884	44822835	99.43%
GCF_015832195.1_Motacilla_alba_V1.0_pri	1072670728	214263853	99.84%
GCF_020379485.1_Da-Ae	1001886053	200287854	99.92%
GCF_020520425.1_BTI_SOV_V1	894736656	179085237	100.04%
GCF_024323335.1_CAAS_Psat_ZW6_1.0	3796066963	757907363	99.81%
GCF_024334085.1_ASM2433408v1	147838017	29550213	99.86%
GCF_026230105.1_icDioSubl1.1	456219552	91418967	100.14%
GCF_029448725.1_ASM2944872v1	2496185110	498778710	99.89%
GCF_900626175.2_cs10	876147649	147476086	84.13%
GCF_922984935.1_mMelMel3.1	2738694446	547924058	100.01%
GCF_947563725.1_qqArgBrue1.1	1778384155	355963536	100.06%

ksahlin commented 1 year ago

Ah okay, since you measured the 'extra constant' C has more or less no impact on memory compared to XXH update, I guess we can increase this constant? How about 4*(len(ref))^0.5 ?

marcelm commented 1 year ago

To get all the percentages above below 100, we would need at least $9\cdot \sqrt{\text{len(ref)}}$. I’d go a bit higher just be on the safe side.

The worst is GCF_026230105.1_icDioSubl1.1, for which the actual no. of randstrobes is 0.19% higher than the estimate without $C$. An alternative is to just add 0.5% or 1% to the estimate. So something like estimate = len(ref) / (k-s+1) * 1.01

ksahlin commented 1 year ago

Okay I see. The 1.01 multiplier sounds good.

marcelm commented 1 year ago

I measured how runtime changes for this PR. We omit one costly step at index generation time, but that step was parallelized previously, so the savings depend on how many threads were used. If I run main with 8 threads and compare, then this PR is between 10 and 15% faster. (If I run main with just one thread, then it’s more like 55% faster, but I don’t think that would be an honest comparison.)

Note that #307, which is an alternative approach for speeding up index generation, would give a 2X speedup.

ksahlin commented 1 year ago

Looks good, approved.

marcelm commented 9 months ago

I’m closing this PR because it conflicts with #307 (filling the randstrobes vector from multiple threads), which was merged already. We need exact randstrobe counts for #307.

I posted an alternative idea (which would need a new PR) in #340.

ksahlin / strobealign

Do not count randstrobes, but estimate their number #314