SIMD化(AVX編) - Githubissues

o-jill commented 2 years ago

AVXなら1回で8個計算できるので倍速になるはず。

問題点: bitboard(or byteboard)からf32 x 8への変換。 128bitの壁をうまく乗り越えれば速くできそう。予め上位と下位で分けてsetしてf32化まで持ってくればいけそう。 Function core::arch::x86_64::_mm256_set_epi64x https://doc.rust-lang.org/beta/core/arch/x86_64/fn._mm256_set_epi64x.html unpackとかはAVX2なので注意。不明点: target featureにavxを指定しないとだめなの？

o-jill commented 2 years ago

1bit x 8マス -> i32 x 8 はこんな感じ

    unsigned long long int pos1 = 0x0706050403020100;
    {
        __m256i src1 = _mm256_set_epi64x(0, (pos1 >> 32), 0, pos1 & 0xffffffffu);
        __m256i zero = _mm256_setzero_si256();
        __m256i src2 = _mm256_unpacklo_epi8(src1, zero);
        __m256i src3 = _mm256_unpacklo_epi16(src2, zero);
        _mm256_store_si256((__m256i*)ret, src2);
        _mm256_store_si256((__m256i*)(ret+32), src3);
    }
    printf("pos1: 0x%0I64X\n", pos1);
    puts("avx i8 x16 -> i32 x8 (1~8)");
    for (int i = 0 ; i < 32 ; ++i) {
        printf("%02X ", ret[i]);
    }
    puts("\navx i8 x16 -> i32 x8 (9~16)");
    for (int i = 0 ; i < 32 ; ++i) {
        printf("%02X ", ret[i + 32]);
    }

o-jill commented 2 years ago

とりあえずbitboardで実装してみたけど

1/3倍速。
探索結果(評価とノード数)が違う

o-jill commented 2 years ago

8個版を作った。1/3倍速。 16個版を作った1/3倍速。評価値(計算途中)が2ぐらい異なることがある。計算順序か？ RUSTFLAGS="-C target-feature=+avx" を付けると2/3倍速ぐらいにはなる。

o-jill commented 2 years ago

環境とかによってセグフォする。アドレスのアライメントが合っていないっぽい。

o-jill commented 2 years ago

変数のアドレスの確認方法

fn print_info(mem: &[u8]) {
    let addr = (&mem[0] as *const u8) as u64;
    let mut bound: u64 = 1;
    while addr & bound == 0 { bound <<= 1; }
    println!("size: {:>10}  addr: 0x{:>012x}  bound: {:>7}", mem.len(), addr, bound);
}

o-jill commented 2 years ago

ちょっとやってみた。 16byte境界になることがあるっぽいです。

        let mut sigmo : [f32 ; N_HIDDEN] = [0.0 ; N_HIDDEN];
        let addr = (&sigmo[0] as *const f32) as u64;
        let mut bound : u64 = 1;
        while (addr & bound) == 0 {bound <<= 1;}
        println!("sigmo: {:X}, bound:{}", addr, bound);

     Running `target/release/ruversi --duel --ev1 evaltable.txt --ev2 evaltable.txt.old`
Hello, reversi world!
unknown option: target/release/ruversi
read eval table: evaltable.txt
sigmo: 7FFC583E1E30, bound:16
Segmentation fault (core dumped)

o-jill commented 2 years ago

_mm256_load_psを_mm256_loadu_psに変えたら一応動いていて、同等かそれ以上の速度ぐらいで動いているように見える。on vm on 9700k(CPUによるかも)
アドレスを気にして確保するにはゴニョゴニョしないとだめらしいので一旦諦めたほうがいいかもしれません。

Rustでページ境界に合わせたメモリアロケーションをするには https://qiita.com/moriai/items/67761b3c0d83da3b6bb5 Rustでallocを使わずにページ境界に合わせたメモリアロケーションをするには https://qiita.com/blackenedgold/items/823ab427477e37995ee6 Extend the existing #[repr] attribute on structs with an align = "N" option to specify a custom alignment for struct types. https://rust-lang.github.io/rfcs/1358-repr-align.html

o-jill commented 2 years ago

_mm256_store_psも_mm256_storeu_psに変えますか？

o-jill commented 2 years ago

load/storeはu付きに変えといた。
16個版でバグってたやつはシフト量をtypoしてた。

o-jill commented 2 years ago

vm on 9700kで、

RUSTFLAGS="-C target-feature=+avx" cargo run --release --features avx 545msec/1979891nodes
RUSTFLAGS="-C target-feature=+avx" cargo run --release --features bitboard 373msec/1979891nodes
RUSTFLAGS="-C target-feature=+avx" cargo run --release 260msec/1979891nodes
RUSTFLAGS="-C target-cpu=native" cargo run --release --features avx 231msec/1979891nodes (=8570nodes/msec)
RUSTFLAGS="-C target-cpu=native" cargo run --release --features bitboard 236msec/1979891nodes
RUSTFLAGS="-C target-cpu=native" cargo run --release 255msec/1979891nodes

o-jill commented 2 years ago

どうやら-C target-cpu=nativeを付けるのが一番いいやり方みたいです。 8265UでもSSE ≒ AVXになった。

o-jill commented 2 years ago

中間層の計算とかsigmoid?exp?のavx化は中間層が8つ以上になってからやります。

o-jill / ruversi

SIMD化(AVX編) #47