`choose_multiple_weighted` returns unexpect probability of result

heavycharged commented 3 months ago

Summary

I am trying to choose N values from vector with weights. All weights calculated as score.powf(4.)

What i expected

I expect values will be chosen so many times, in relation to their weights. Here is the weights after powf function:

value=1 has weight=0.0000409600
value=2 has weight=0.0000000037
value=3 has weight=0.0000000207

As you can see, the lowest weight has value 2, and i expect this value will be chosen least frequent.

What i got

got following results: {2: 100000, 3: 1, 1: 99999}

As you can see, value 2 had been choosen 100k times, more than value 3, that is only chosen once. Here is the score of values 2 and 3:

2: 0.0000000037
3: 0.0000000207

Note

I noticed that this happens with very small floats. Also, i guess that the initial order is matters. Both of this not mentioned in documentation, so this is either docs issue, or bug. Also i do not exclude that i may missed something, so maybe i am wrong.

Code sample

use std::collections::HashMap;

use rand::{prelude::SliceRandom, thread_rng};

fn main() {
    let values: Vec<(i32, f64)> =
        vec![(1, 0.080), (2, 0.0078), (3, 0.012)]
            .into_iter()
            .map(|v| (v.0, f64::powf(v.1, 4.0)))
            .collect();

    for v in &values {
        println!("value={} has weight={:.10}", v.0, v.1);
    }

    let mut counts = HashMap::new();

    for i in 0..100_000 {
        let winning: Vec<i32> = values
            .choose_multiple_weighted(&mut thread_rng(), 2, |a| a.1)
            .unwrap()
            .map(|v| v.0)
            .collect();

        for winner in winning {
            counts.entry(winner).and_modify(|v| *v += 1).or_insert(1);
        }
    }

    println!("got following results: {counts:?}");

    assert!(counts.get(&1).unwrap() > counts.get(&2).unwrap());
    assert!(counts.get(&1).unwrap() > counts.get(&3).unwrap());
    assert!(counts.get(&3).unwrap() > counts.get(&2).unwrap());
}

dhardy commented 3 months ago

I can reproduce this. More specifically,

got following results:
(1, 1): 0
(1, 2): 0
(1, 3): 1
(2, 1): 0
(2, 2): 0
(2, 3): 96875
(3, 1): 3124
(3, 2): 0
(3, 3): 0

dhardy commented 3 months ago

The problem is in sample_efraimidis_spirakis (https://doi.org/10.1016/j.ipl.2005.11.003), which calculates key = rng.random::<f64>().powf(1.0 / weight); usually the result rounds to 0 due to the very small weights used.

Note that our f64 random samples have limited precision. We have considered (and even implemented) algorithms for high-precision f64 values. This was not adopted since there is performance overhead and usually it's not necessary. Increasing precision could not be expected to solve this particular issue anyway.

More to the point, this algorithm does not appear to be suitable for the small weights used as inputs here. Pre-scaling of weights into some more suitable range may be a solution, however a key point of the algorithm used is that it requires only a single iteration over a list of weights (which are in this case generated via call to the weight function for each value). Without knowing the largest input weight in advance we would have to iterate twice over the weights (storing or regenerating). At this point a different API may be better; or we could allow the user to input the maximum weight (or a weight scaling function) as a hint.

For this particular example, a partial fix is to scale by the inverse of the largest weight in the weight fn: choose_multiple_weighted(&mut thread_rng(), 2, |a| a.1 / largest_weight). The results are better but still likely wrong since key often rounds to zero for values 2 and 3:

got following results:
(1, 1): 0
(1, 2): 6
(1, 3): 58
(2, 1): 5272
(2, 2): 0
(2, 3): 0
(3, 1): 94664
(3, 2): 0
(3, 3): 0

@TheIronBorn have you got any suggestions? It seems that the algorithm is just a poor choice for small weights.

dhardy commented 3 months ago

We could keep this algorithm but document its limitations while documenting its limitations.

In this case, the best (or at least easiest) alternative may be to use WeightedIndex, re-sampling in the case that identical samples are returned (assuming that there are enough items available and likely to be selected, this should be reasonably efficient).

Choose-multiple-weighted is not a problem I've put much thought into; there are probably other algorithms.

dhardy commented 3 months ago

@teryror I see you have experience with this algorithm (#1365). Any thoughts here? Wikipedia suggests using ln(r) / w_i keys instead.

rust-random / rand