mokemokechicken / reversi-alpha-zero

Reversi reinforcement learning by AlphaGo Zero methods.
MIT License
677 stars 170 forks source link

It may forget pertinent information about positions that it no longer visits. #38

Open apollo-time opened 6 years ago

apollo-time commented 6 years ago

I see my model don't be improved anymore. Moreover I found "It may forget pertinent information about positions that it no longer visits" as ThomasWAnthony's when opinion select action unusually. @mokemokechicken, @gooooloo How about it?

mokemokechicken commented 6 years ago

@apollo-time

I think that there is that's possibility, and if we want to improve the model more and more, we need larger sim_per_move and self-play dataset.

I have a simple hypothesis that

so, I feel that increasing sim_per_move and dataset size gradually is effective. (I think that Human also do that to become professional.)

apollo-time commented 6 years ago

I think larger slim_per_move and self-play dataset can't resolve no longer visits problem, because the unusually positions can't be selected by self-play MCTS. So I try select fully random action sometimes in self-play, and ignore previous history of the random action.

AranKomat commented 6 years ago

@mokemokechicken I asked @gooooloo a similar question in other thread, but what is the default ratio of the number of games per gradient update ratio of your algorithm? I guess the ratio is important for the performance, since it behaves like sims/move, which is undoubtedly important.

mokemokechicken commented 6 years ago

@AranKomat

what is the default ratio of the number of games per gradient update ratio of your algorithm?

I do not know which number to answer concretely, but the resulting speed is as follows.

setting

speed

so

Maybe, it means that 1 position is learned 68 times regardless (nb_game_in_file, max_file_num).

AranKomat commented 6 years ago

Thanks for your answer. In the case of Go with AlphaZero, 700k minibatches (2048 positions each) and 21 million self-play games were performed. Assuming that each game ended with 150 stones (positions) placed, 700k x 2048/(21m x 150)=0.44 [trained position]/[self-play-generated position], which is much less than 68. So, I guess you can improve your performance with more self-plays per update. Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration. Since having more games generated means more diverse data than having more sims/move, so spending more time on self-play may be more beneficial than more sims/move. But in practice, since your alg doesn't allow multi-processing (of multiple games) as done by Akababa, my suggestion may be not useful. But this may be useful for @gooooloo .

mokemokechicken commented 6 years ago

@AranKomat

I guess you can improve your performance with more self-plays per update.

I think so too. In my environment, although GPU usage is already 100%(by self-play and training), implementing multiprocess self-play will increase self-play games per training.

So I am planing to implement multiprocess self-play, However, it is under consideration whether or not it really works with the present method.

mokemokechicken commented 6 years ago

I am testing on feature/multiprocess_selfplay,

when 16 parallel in self-play,

so

AranKomat commented 6 years ago

Cool. So, multi-processing successfully decreased the ratio and achieved 36s per game under 400 sims/move. Now, it suffices to elucidate the trade-off between training/selfplay ratio and sims/move. I'm excited for your subsequent announcements!

mokemokechicken commented 6 years ago

I also added wait to optimizer to change the ratio.

Now,

so

gooooloo commented 6 years ago

@AranKomat

Mine is:

I actually don't understand below number @mokemokechicken mentioned:

400 positions per 1 self-play game

But if I just use this number, then I have self-play speed: 80 positions per second (=400/5). Then Training/SelfPlay Ration: 5.3 (=426/80)

gooooloo commented 6 years ago

Thanks for your answer. In the case of Go with AlphaZero, 700k minibatches (2048 positions each) and 21 million self-play games were performed. Assuming that each game ended with 150 stones (positions) placed, 700k x 2048/(21m x 150)=0.44 [trained position]/[self-play-generated position], which is much less than 68. So, I guess you can improve your performance with more self-plays per update. Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration. Since having more games generated means more diverse data than having more sims/move, so spending more time on self-play may be more beneficial than more sims/move. But in practice, since your alg doesn't allow multi-processing (of multiple games) as done by Akababa, my suggestion may be not useful. But this may be useful for @gooooloo .

Thanks @AranKomat . I didn't see this post until just now...

I guess you can improve your performance with more self-plays per update

Yes, I also think so. Deepmind uses 2000+ or 4000+ TPU for selfplay (as Aja Huang says in a post, I just can't remember the link). We can see the self play performance is important.

Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration.

Actually I was getting an smaller selfplay/training ratio when increasing sims/move from 100 to 800. Although I also introduced multi process implementation at that time, the overall self play game speed is a little bit slower than before. Yet I observe the AI strength improvement.

AranKomat commented 6 years ago

@gooooloo In AlphaZero, staggering 5000 TPUs were used, so I totally agree. It's weird but nice that increased sims/move resulted in a smaller ratio. Hopefully, @mokemokechicken and others will observe a similar phenomena.

mokemokechicken commented 6 years ago

400 positions per 1 self-play game

Note: I used (nb_game_in_file, max_file_num)=(5, 300), so the number of total games in training data was 1500 (games). My training dataset size was about 600k (positions). So, 600k / 1500 = 400 (position/game).

gooooloo commented 6 years ago

I used (nb_game_in_file, max_file_num)=(5, 300), so the number of total games in training data was 1500 (games). My training dataset size was about 600k (positions). So, 600k / 1500 = 400 (position/game).

But a reversi game has up to 60 position to move, isn't it? Event with up to 5 "PASS" move, it is 65. Then even with game state flip and rotation, it is at most 260.

UPDATE: Oh my fault, "flip and rotation" gives a x8 multiplication, not x4. Then it makes sense. 400/8=50, you are playing 50 moves per game, giving you have a resignation mechanism.

gooooloo commented 6 years ago

... had a small self-play/training ratio

It's weird ... that increased sims/move resulted in a smaller ratio

The ratio is # of selp play moves / # of trained moves. I increased # sims per move, then self play got slower, then # of self play moves smaller. But training module not changed. So the total ratio got smaller. Isn't it?

AranKomat commented 6 years ago

@gooooloo Sorry, I thought you were talking about training/self-play ratio, but it was opposite. My mistake. I also agree with you about the number of positions per game.

gooooloo commented 6 years ago

@AranKomat I made a mistake calculating. Please see that post again, I modified it.

AranKomat commented 6 years ago

@gooooloo Well, that makes sense. But when I said 150 stones on average Go game, I didn't take into account the symmetries, so for fair comparison I didn't consider symmetries of reversi, which has the same set of symmetries as Go. Sorry for not being explicit. Since what we're concerned with is the ratio between our training/self-play ratio (5.3 after symmetries) vs. AZ's training/self-play ratio (about 0.44, but it's 0.44/8=0.055 after symmetries), there's still 100 times of difference, which is reasonable given the number of GPUs we're using.

mokemokechicken commented 6 years ago

It is strange that training/self-play ratio becomes under 1. It means that there are positions not used in training. So, I think the ratio was almost 1.

AranKomat commented 6 years ago

The ratio of 0.44 was obtained from AlphaZero, where symmetry wasn't exploited. Also, Shogi and Chess cannot exploit symmetries, so they set the self-play vs training ratio of AlphaZero based on the assumption that self-play data isn't necessarily as plentiful as in symmetric games. Without symmetry, the ratio is 0.44, which is closer to 1. The ratio for Shogi and Chess may be even closer to 1. Also, in symmetric games without symmetric data augmentation, the NN quickly learns symmetry, which was demonstrated by AZ being superior to AGZ in Go. Considering the eventual meaninglessness of symmetric data augmentation, the net ratio of @gooooloo becomes 5.3*8=42.4. So, he needs at least 42 times more GPUs for self-play to get to 1.

gooooloo commented 6 years ago

@AranKomat @mokemokechicken I double checked my pipeline's performance, should be 25 processes + 180 second per game per process, which gives 7 seconds per game in average. Then My ratio should be about 7.*(=426/(400/7)), not 5.3.