Squeeze a tiny bit more performance out of day 21

jdashton commented 2 years ago

Simon, thank you SO MUCH for sharing your code for 2021. It has been really eye-opening!

I've learned so much from the approach you take; I'm still trying to see problems more as you see them. I re-implemented your Day 21 solution in my own Ruby code, and was delighted that it can get under 600 ms. I then re-worked it to use various combinations of multi-threading, recursion and caching, and the fastest time I saw was close to 250 ms. That is for a recursive, multi-threaded, caching solution that runs in 49 threads when my system is otherwise very unoccupied. It's fun to see all 16 cores of my i9 at least show a blip.

You also inspired me to try applying the same techniques to your Rust code. I finally found a combination that consistently turns-in slightly faster times. I called it Day 26, to make it easy to run beside your original Day 21 code.

:: day21
generator: 16.936µs
part1: (8.871µs, 518418)
part2: (62.761465ms, 116741133558209)

:: day26
generator: 17.049µs
part1: (8.971µs, 518418)
part2: (50.943622ms, 116741133558209)

Here is the relevant code:

const ROLLS: [(u16, usize); 7] = [(3, 1), (4, 3), (5, 6), (6, 7), (7, 6), (8, 3), (9, 1)];

fn seven_possible_rolls(
    player: Player,
    other_player: Player,
    quantity: usize,
    cache: Arc<Mutex<[(usize, usize); 53714]>>,
) -> [usize; 2] {
    let state = State(player, other_player);
    let (mut wins_a, mut wins_b) = {
        let cache = cache.lock().unwrap();
        cache[state.pack()]
    };
    if wins_a == 0 && wins_b == 0 {
        for (roll, qty) in &ROLLS {
            let mut player = player;
            player.advance(*roll);

            if player.score() >= 21 {
                wins_a += qty;
                continue;
            }

            let [unv_a, unv_b] =
                seven_possible_rolls(other_player, player, *qty, Arc::clone(&cache));
            wins_a += unv_b;
            wins_b += unv_a;
        }
        {
            let mut cache = cache.lock().unwrap();
            cache[state.pack()] = (wins_a, wins_b);
        }
    }
    [wins_a * quantity, wins_b * quantity]
}

pub fn part2(input: &(u16, u16)) -> usize {
    let mut wins1 = 0;
    let mut wins2 = 0;
    let mut handles = vec![];

    let cache = Arc::new(Mutex::new([(0, 0); 0xd1d2]));

    for (roll, p1_hits) in &ROLLS {
        let mut player1 = Player::new(input.0);
        let player2 = Player::new(input.1);
        player1.advance(*roll);

        let cache = Arc::clone(&cache);
        handles.push(thread::spawn(move || {
            seven_possible_rolls(player2, player1, *p1_hits, cache)
        }));
    }
    for handle in handles {
        let [wins_a, wins_b] = handle.join().unwrap();
        wins1 += wins_a;
        wins2 += wins_b;
    }

    wins1.max(wins2)
}

With your experience, you may see further optimizations that I can't see yet.

Thank you again for all the motivation!

jdashton commented 2 years ago

And, to clarify, I first tried the Rust solution with 49 threads (one from each of Player 2's first round of rolls), but that tended to be about 8 ms slower than your straight-forward approach. When I moved to instead start 7 threads, one from each of Player 1's first round of rolls, that's when this solution became faster. Creating threads is costly, so this makes sense to me. Also, having 7 threads contending for the mutex should be much better than having 49 threads contending for it.

jdashton commented 2 years ago

And . . . I just checked the numbers when run with --release, and the straight-forward solution takes only one third the time of the recursive caching approach.

:: day21
generator: 1.136µs
part1: (444ns, 518418)
part2: (2.328331ms, 116741133558209)

:: day26
generator: 842ns
part1: (426ns, 518418)
part2: (7.983128ms, 116741133558209)

Optimizations rock!

vodik / aoc

Squeeze a tiny bit more performance out of day 21 #1