Replacing CNN with decoder-only Transformer for possible acceleration?

AranKomat commented 6 years ago

As I mentioned before, I'm working on applying AlphaZero to text generation using decoder-only Transformer instead of CNN. My implementation is nearly finished, but I haven't tested to see its performance on text generation. Besides, Transformer can be used for board games like reversi, since you can represent each move as a symbol (for example, you can represent any move of reversi with a number from 0 to 63). Obviously, this doesn't contain any geometric information, but it's interesting to see whether this info is really that important or not compared with the speed advantage, which is because layer-wise per-move FLOPS is now roughly bs x 4 x hidden_dim^2 instead of bs x 8^2 x hidden_dim^2 x 3^2, which is 144x faster. Any question? If you're interested, I'll notify as soon as my implementation works, so that you and I can extract necessary components to apply to your reversi stuff.

mokemokechicken commented 6 years ago

@AranKomat

As I mentioned before, I'm working on applying AlphaZero to text generation using decoder-only Transformer instead of CNN. My implementation is nearly finished, but I haven't tested to see its performance on text generation.

Though I can't imagine correctly "applying AlphaZero to text generation", it is very interesting.

Besides, Transformer can be used for board games like reversi, since you can represent each move as a symbol (for example, you can represent any move of reversi with a number from 0 to 63). Obviously, this doesn't contain any geometric information, but it's interesting to see whether this info is really that important or not compared with the speed advantage, which is because layer-wise per-move FLOPS is now roughly bs x 4 x hidden_dim^2 instead of bs x 8^2 x hidden_dim^2 x 3^2, which is 144x faster. Any question?

Very interesting. I also think CNN can be replaced by Transformer's attentions, but I have not verified it correctly. There is possibility that it achieves same performance by less layers and computation. However I think geometric information is very important in reversi, it might be difficult to become strong. It is better to use attentions keeping geometric information for reversi (if possible).

If you're interested, I'll notify as soon as my implementation works, so that you and I can extract necessary components to apply to your reversi stuff.

I am very interested in your works! Please let me know if you can.

AranKomat commented 6 years ago

How the text generation is done is explained in the readme of the repo for which I sent an invitation to you. At this moment, a single GPU (g3.4xlarge) processes 32 self-play games (usually 30 moves in each game) with 80 sims/move in 10 seconds. If there's no CPU bottleneck, I'm sure I can make it less than 1 sec. The CPU bottleneck is discussed in details in an issue in my repo. Solving this bottleneck is helpful for symbol-base reversi-alphazero, too.

The transformer that retains geometric information is Image Transformer, so I didn't implement it. While it's interesting to try it, it takes too much time to verify its superiority over CNN.

mokemokechicken commented 6 years ago

@AranKomat

Thank you very much for your invitation! I could imagine what you are trying by seeing the repository name. I was very surprised. I will look in detail later!

mokemokechicken / reversi-alpha-zero

Replacing CNN with decoder-only Transformer for possible acceleration? #50