Open heavycharged opened 3 months ago
I can reproduce this. More specifically,
got following results:
(1, 1): 0
(1, 2): 0
(1, 3): 1
(2, 1): 0
(2, 2): 0
(2, 3): 96875
(3, 1): 3124
(3, 2): 0
(3, 3): 0
The problem is in sample_efraimidis_spirakis
(https://doi.org/10.1016/j.ipl.2005.11.003), which calculates key = rng.random::<f64>().powf(1.0 / weight)
; usually the result rounds to 0 due to the very small weights used.
Note that our f64
random samples have limited precision. We have considered (and even implemented) algorithms for high-precision f64
values. This was not adopted since there is performance overhead and usually it's not necessary. Increasing precision could not be expected to solve this particular issue anyway.
More to the point, this algorithm does not appear to be suitable for the small weights used as inputs here. Pre-scaling of weights into some more suitable range may be a solution, however a key point of the algorithm used is that it requires only a single iteration over a list of weights (which are in this case generated via call to the weight function for each value). Without knowing the largest input weight in advance we would have to iterate twice over the weights (storing or regenerating). At this point a different API may be better; or we could allow the user to input the maximum weight (or a weight scaling function) as a hint.
For this particular example, a partial fix is to scale by the inverse of the largest weight in the weight fn: choose_multiple_weighted(&mut thread_rng(), 2, |a| a.1 / largest_weight)
. The results are better but still likely wrong since key
often rounds to zero for values 2 and 3:
got following results:
(1, 1): 0
(1, 2): 6
(1, 3): 58
(2, 1): 5272
(2, 2): 0
(2, 3): 0
(3, 1): 94664
(3, 2): 0
(3, 3): 0
@TheIronBorn have you got any suggestions? It seems that the algorithm is just a poor choice for small weights.
We could keep this algorithm but document its limitations while documenting its limitations.
In this case, the best (or at least easiest) alternative may be to use WeightedIndex
, re-sampling in the case that identical samples are returned (assuming that there are enough items available and likely to be selected, this should be reasonably efficient).
Choose-multiple-weighted is not a problem I've put much thought into; there are probably other algorithms.
Summary
I am trying to choose N values from vector with weights. All weights calculated as score.powf(4.)
What i expected
I expect values will be chosen so many times, in relation to their weights. Here is the weights after
powf
function:As you can see, the lowest weight has value
2
, and i expect this value will be chosen least frequent.What i got
As you can see, value
2
had been choosen100k
times, more than value3
, that is only chosen once. Here is the score of values 2 and 3:Note
I noticed that this happens with very small floats. Also, i guess that the initial order is matters. Both of this not mentioned in documentation, so this is either docs issue, or bug. Also i do not exclude that i may missed something, so maybe i am wrong.
Code sample