poja / Cattus

Cattus is a chess engine based on DeepMind AlphaZero paper, written in Rust. It uses a neural network to evaluate positions, and MCTS as a search algorithm.
4 stars 0 forks source link

Serializer: use raw bytes instead of JSON #81

Closed barakugav closed 2 years ago

barakugav commented 2 years ago

The games directory size exceed 1G easily, by storing the raw bytes of the training data entries we can save a lot of space

poja commented 2 years ago

Protobuf could be a good tool for this

barakugav commented 2 years ago

I think protobuf is a little bit overkill for this. This is simple impl: https://github.com/poja/RL/pull/97

We may want to use protobuf to pass parameters from Python to Rust

barakugav commented 2 years ago

Apparently this doesn't decrease the file sizes... The new format has a fixed size of 8k (in chess), while JSON format size depends on the data. In most cases most of the moves are illegal, and their probability is -1, which is only two bytes, and from what I saw most of the files sizes are 6k.

Not sure what to do here. Either formats is fine. If this becomes real issue, we can can change the format to the following (of raw bytes, not JSON):

{ planes: [u64: 18], moves_bitboard: [u64: 30], probs: [f32: 256], winner: i8 }

Instead of storing all 1880 moves probs, we store a bitmap of size 1880 which tell us which moves are included in the 'probs' array. This take advantage of the fact that no more than 256 are ever legal in the same position in chess. This will result in a fixed size entry of 1.5k bytes. This is slightly compicate things, but not much and it is very self contained

barakugav commented 2 years ago

On a second thought, lets try and merge this feature (the simple raw bytes, not bitmap), in Hex this will have a bigger impact

barakugav commented 2 years ago

https://github.com/poja/RL/pull/97/commits/5bc4d057b23c858069810c9773c7dd521a939cba 1280 bytes! Now that is more reasonable

poja commented 2 years ago

We may want to use protobuf to pass parameters from Python to Rust

This is not the main subject of this thread, but still - I think textual formats are to be preferred where performance is not an issue (and I think this applies here)