bitboardって美味しいの？

o-jill commented 2 years ago

byteboardをbitboardに変えるとなんかいいことがあるんだろうか？

2面 x 8x8bitで16byte < 64byte

とりあえず評価の仕方(SIMD計算)を変えないといけないっぽいです。置換表のハッシュの計算も変えないといけないっぽいです。裏返すとか画面表示とかも変えないといけないっぽいです。

o-jill commented 2 years ago

ハッシュの計算は手番を気にする必要はないのではないか？どうしても気にしたいなら00でblankのところを11でblankに変換してはどうか？

評価でSIMD計算をするときに、1bitをi32(or f32)に変換するにはどうやるとよいか？

o-jill commented 2 years ago

bitboardの最低限を実装した。 45d468950 to_id_simd()がまだ。少なくともweight.rs, node.rsに対応が必要。 node.rsはbitboard版を別で作ったほうが良いかも。board::Boardと書かれているところがいっぱいある。。置換するだけっちゃぁだけ。

o-jill commented 2 years ago

作ってみた。 9fb9720b9e9 byteboardの1.5倍時間がかかる。

eval 盤の読み込みが微妙。
reverse, checkreverse
fixedstone

o-jill commented 2 years ago

eval_simdが遅そうだ。 nosimdだとbitboardとbyteboardで大差なし。

o-jill commented 2 years ago

ビット割付をXY反対にするとSIMDが簡単にやれそうです。今： MSB a1,a2,a3,...h6,h7,h8 LSB XY逆： MSB: h8, g8, f8, ... d1, c1, b1, a1 LSB

逆なら横一列取り出すのに0x8080808080808080uとANDを取ると1バイトづつ並ぶのでSIMDもやりやすいのでは？

o-jill commented 2 years ago

一通り実装したものの、byteboardと評価ノード数が異なる。 2cb0d2b529

たぶんreverse()がおかしいんじゃないかと予想。テストパターンを増やしましょう。左下から右上にひっくり返すようなやつのテストがない。

npsはいい勝負。

o-jill commented 2 years ago

直した。有意差なしぐらいの速度になった。

byteboard:
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 512msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 524msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 504msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 513msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 517msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 491msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 474msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 479msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 515msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 501msec
val:-3.506 val:Some(-3.5055861), 2231980 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 496msec
502.4 ±15.29 (474 ~ 524) msec

bitboard:
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 517msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 483msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 512msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 507msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 484msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 537msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 486msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 514msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 518msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 484msec
val:-3.506 val:Some(-3.5055861), 2232105 nodes. @@d2[]e1@@f1[]b1@@a4[]a5@@a7 502msec
504 ±17.05 (483 ~ 537) msec

o-jill commented 2 years ago

--duelで計測。bitboardのほうがちょっとだけ速い？

byteboard:
real    18m40.762s
user    30m44.370s
sys     0m8.785s
bitboard:
real    17m54.182s
user    29m28.839s
sys     0m8.000s

o-jill commented 2 years ago

まだ学習に適用できてない。

forward
backward

学習が速くなるのはありがたいけど最悪どっちでもいいのかもしれない。

o-jill commented 2 years ago

とりあえず学習っぽいものが動くようにしてみた。 byteboardと同等かどうかは確認してない。ちょっとぐらい速くなっていると良いな。

o-jill commented 2 years ago

同じ出力になった。

byteboard: 0.077msec/file bitboard: 0.072msec/file

o-jill commented 2 years ago

bitboardをデフォルトにしました。 fac8bc3349

o-jill commented 2 years ago

なんか出力がやっぱり違う気がするのでデフォルトじゃなくなりました。

o-jill commented 2 years ago

乱数で盤面を生成
bitboard (avx, sse, nosimd), byteboard(sse, nosimd)で計算
各出力を比較。今だと1以上差があれば計算が間違っていると思われる。

乱数で盤面を生成して読み込むための何か(rfen化するやつ)が必要と思われる。

o-jill commented 2 years ago

乱数でbyteboardを生成 ~~randをcargo.tomlに追加する必要あり。~~ <- 既に使ってました。 byteboardを生成→rfen→bitboardでどうか？

[dependencies]
rand = "0.6"

extern crate rand;
use rand::Rng;
use rand::distributions::{Distribution, Uniform};

fn main() {
    let mut rng = rand::thread_rng();
    let die = Uniform::from(-1..=1);

    let mut cells : [i8 ; 64] = [0 ; 64];
    for c in cells.iter_mut() {
        *c = die.sample(&mut rng);
    }
    println!("{:?}",  cells);
}

o-jill commented 2 years ago

fn from_array(cells : [i8 ; CELL_2D], tbn : i8) -> Board {
  Board {
    cells : cells,
    teban : t,
    pass : 0,
  }
}

o-jill commented 2 years ago

簡単な方法としてforwardの結果を比較すればいいのではないか？

o-jill commented 2 years ago

forwardv3bb()が間違ってた。直した。
evaluatev3bb_simdavx()が間違ってた。 forwardv3bb_simdavx()を作ってfeedbackするのはどうか？

o-jill commented 2 years ago

forwardv3bb_simdavx()を作った。近いところまでは行ったがsigmoid前で２ぐらい値が異なる。足し算の順序か？

o-jill commented 2 years ago

CPUによっては計算結果は同じになるっぽいぞ？
CPUによってはbyteboardよりちょっとだけ速そうだぞ？

ref: https://github.com/o-jill/ruversi/issues/47#issuecomment-1230948488

o-jill commented 2 years ago

速度はtarget-cpu=nativeとすることでbyteboardより速くなるようです。 (target-cpu=native大事)

再度、bitboardをデフォルトにしました。 92b19d54

o-jill commented 2 years ago

計測してみた on vm on 9700k

time RUSTFLAGS="-C target-cpu=native" cargo run --release --features avx -- --duel --ev1 evaltable.txt --ev2 evaltable.txt.old

real    10m3.107s
user    15m47.128s
sys     0m14.146s

time RUSTFLAGS="-C target-cpu=native" cargo run --release --features bitboard -- --duel --ev1 evaltable.txt --ev2 evaltable.txt.old

real    9m46.265s
user    15m48.323s
sys     0m6.639s

time RUSTFLAGS="-C target-cpu=native" cargo run --release -- --duel --ev1 evaltable.txt --ev2 evaltable.txt.old

real    11m10.496s
user    18m5.087s
sys     0m6.649s

o-jill / ruversi

bitboardって美味しいの？ #38