mokemokechicken / reversi-alpha-zero

Reversi reinforcement learning by AlphaGo Zero methods.
MIT License
677 stars 170 forks source link

Baseline Comparison? #26

Open mrlooi opened 6 years ago

mrlooi commented 6 years ago

Is there a baseline for comparing the learned model e.g. a benchmark software to evaluate against? It would be useful for us to know how effective the learning algorithm actually is.

For example, what do you mean by "Won the App LV x?" Does it mean that if the model beat the app even once, it counts as a win even if it loses the other times?

I downloaded your "best model" and "newest model", and played both networks against grhino AI (level 2). Sadly, both networks got destroyed by grhino on multiple tries. If you have a benchmark of levels to beat before grhino, that would be really helpful

evalon32 commented 6 years ago

In README, it says the "App" is this: https://itunes.apple.com/ca/app/id574915961. I'm not familiar with it and don't have an iOS device, but I'm guessing it's not that strong. For what it's worth, I've also been testing the networks against grhino, with similar results. I've had RAZ beat ghrino L2 once, but only because it got lucky. That said, I think it's a good sign that RAZ can now tell that its position gradually deteriorates (the evaluation goes relatively smoothly from 0 to -1 over the course of the game). Earlier, it often had no idea. It also used to lose consistently to grhino L1; now it usually wins (sadly, it's usually because grhino L1 blunders in a won position).

mokemokechicken commented 6 years ago

Hi @vincentlooi

Is there a baseline for comparing the learned model e.g. a benchmark software to evaluate against?

I use iOS app of https://itunes.apple.com/ca/app/id574915961 as the benchmark. The app has 1 ~ 99 levels.

For example, what do you mean by "Won the App LV x?" Does it mean that if the model beat the app even once, it counts as a win even if it loses the other times?

Yes. "Won the App LV x?" means the model won the level at least once (regardless of the number of losses).

I downloaded your "best model" and "newest model", and played both networks against grhino AI (level 2). Sadly, both networks got destroyed by grhino on multiple tries. If you have a benchmark of levels to beat before grhino, that would be really helpful

I didn't know grhino. And I confirmed that the newest model loses grhino Lv2...

mokemokechicken commented 6 years ago

Hi @evalon32

In README, it says the "App" is this: https://itunes.apple.com/ca/app/id574915961. I'm not familiar with it and don't have an iOS device, but I'm guessing it's not that strong.

The app has levels of 1~99. Maybe the lv29 is not so strong.

For what it's worth, I've also been testing the networks against grhino, with similar results. I've had RAZ beat ghrino L2 once, but only because it got lucky.

I would like you to tell me, what is RAZ? (I couldn't search it in google ...)

That said, I think it's a good sign that RAZ can now tell that its position gradually deteriorates (the evaluation goes relatively smoothly from 0 to -1 over the course of the game)

I also think that is a good feature. In my newest model, the evaluation often plummets.

evalon32 commented 6 years ago

I would like you to tell me, what is RAZ? (I couldn't search it in google ...)

Oh sorry, RAZ = reversi-alpha-zero :)

mokemokechicken commented 6 years ago

RAZ = reversi-alpha-zero :)

Oh, I see! (^^

mokemokechicken commented 6 years ago

FYI:

evalon32 commented 6 years ago

I just had the newest model play a match of 10 games vs grhino L2 (took forever, since I don't have a GPU). It won 2 out of 5 as black and 2 out of 5 as white. Getting exciting!

mokemokechicken commented 6 years ago

That's good!

took forever, since I don't have a GPU

FYI: I am also evaluating on Mac(not have a GPU), optimized TensorFlow(1.4) is about 3~5 times faster than normal pip CPU version. https://www.tensorflow.org/install/install_sources

mrlooi commented 6 years ago

I managed to make some progress in training the model. I played the model against grhino lv2 5 times: 4 wins, 1 loss. Still lost vs grhino lv3 though. I also played the model against the newest/best model in your download script, and had a win rate of ~85% over roughly 25 games.

I managed to train this model over the course of a week from scratch (on 1080 GPU), by constantly removing old data (data older than 1-2 days) manually from the data/play_data folder each time while the model keeps self-playing.

The current training method in your script trains on all data in the folder regardless of when the data was created, which means training per epoch iteration will always become longer as self-play generates more and more data. I'm not sure if this is necessary, since old data reflects older policy and not necessarily the newest policy, and hence could be redundant at the cost of more training steps and potentially overfitting. Perhaps it might be a good idea to weight the data based on how recently it was played i.e. how much the data reflects the latest policy, or consider turning the data into a fixed size buffer (perhaps 250k-300k samples) that discards old data as new ones are generated

EDIT: Just beat grhino lv3! The model now beats grhino lv2 almost every time, getting exciting

mokemokechicken commented 6 years ago

@vincentlooi

Thank you for sharing exciting information!

EDIT: Just beat grhino lv3! The model now beats grhino lv2 almost every time, getting exciting

That's great!!

I managed to train this model over the course of a week from scratch (on 1080 GPU), by constantly removing old data (data older than 1-2 days) manually from the data/play_data folder each time while the model keeps self-playing.

Nice try! I also think it is one of the important hyperparameter. The max sample number of training data can be changed by PlayDataConfig#{nb_game_in_file,max_file_num} (used here ).

I will change the parameter in my training. In my environment, the number of training data files generated by self-play is about 100/day (500 games/day). So, it seems better to set max_file_num around 300 (currently 2000).

apollo-time commented 6 years ago

what is the best reversi game? I have not iPhone but have Mac. My model beats all of Android Reversi and Windows App Reversi.

mokemokechicken commented 6 years ago

@apollo-time

I use GRhino by docker on mac. FYI: https://github.com/mokemokechicken/grhino-docker

gooooloo commented 6 years ago

@mokemokechicken @vincentlooi @evalon32 When playing with GRhino, besides the "level" setting, what is your "open book varation" setting? I am playing Ubuntu GRhino with my model, and want to do a (indirect) comparsion with yours. Thanks.

apollo-time commented 6 years ago

My model (black) beats GRhino lv5 with open book variation "Low" and randomness 0 now. I make web player html, but I haven't any server to run tensorflow model.

mokemokechicken commented 6 years ago

@gooooloo My open book variation is "Low".

gooooloo commented 6 years ago

@mokemokechicken gotcha. Thanks.

apollo-time commented 6 years ago

I see "Online Reversi" on the Microsoft Store is very excellent. My model beats level 2 hardly. (2018/01/10) My model beats level 3 hardly now. (2018/01/11)

gooooloo commented 6 years ago

Hi everyone, I found http://www.orbanova.com/nboard/ is very strong. Also it supports many levels to play with. Would be a good baseline to compare with.

mokemokechicken commented 6 years ago

@gooooloo it's great! Thank you very much!

mokemokechicken commented 6 years ago

I implemented NBoard Protocol.

gooooloo commented 6 years ago

@mokemokechicken Just a report, my model beats Lv99 using 800 simulations per move setting. See https://play.lobi.co/video/17f52b6e921be174057239d39d239b6061d3c1c9. The AlphaGoZero method works. I am also using 800 simulations per move when self play. I keep the evaluator alive, with best model replacing condition: ELO rating >= 150 among 400 games( with ELO rating we are counting draw games in) . I am using 2 historical boards as Neural Network input, which means a shape of 588.

Besides, when playing with the App, I found using 40 or 100 simulations per move setting is already quite strong. The 100 sims setting beats Lv98 easily. But Lv99 is more difficult than Lv98, I tested 40/100/400 sims and all of them loses, until I changed to 800 sims.

mokemokechicken commented 6 years ago

@gooooloo

Great! Congratulation!!

I am surprised to hear from this report!

800 simulations per move

After all, in order to be strong, it may be necessary to use large "simulations per move" in self-play, isn't it? I am feeling that "simulations per move" decides the model's upper strength.

2 historical boards as Neural Network input, which means a shape of 5 8 8

It is very interesting. Why do you use history? Do you think it brought good effects?

gooooloo commented 6 years ago

@mokemokechicken it is halfly because of your great implementation. So thank you :)

After all, in order to be strong, it may be necessary to use large "simulations per move" in self-play, isn't it?

I also think so. At first, I was using 100 sims per move. I wanted a fast self play speed. After about 100k steps( batch_size = 3072 ),it seemed got stuck and not improving. Then changed to 800 sims. Then at about 200k steps, it has become quite strong. My final model beating lv99 is at 300k+ steps.

I think what is also worthy mentioning is that, although I changed to 800 sims, I didn't make the overall selfplay too much slower. I did this by separating MCTS and Neural Network to different processes. They communicate via named pipes. Then I can run several MCTS processes and only 1 Neural Network process at the same time. This idea is borrowed from this repo (Thanks @Akababa ). By doing this, I make full use of GPU and CPU. Although a simple game get slower due to 800 sims, but multi-games parallelization saves back a lot. ---- I am mentioning this because I think in AlphaGoZero method, self play speed does matter.

2 historical boards as Neural Network input, which means a shape of 5 8 8

Why do you use history?

Because I happened to see this reddit post from David Silver @ DeepMind. This is the quote:

it is useful to have some history to have an idea of where the opponent played recently - these can act as a kind of attention mechanism (i.e. focus on where my opponent thinks is important)

I use this implementation from the beginning and didn't test the 3 8 8 shape, so I don't have the experience to say. But I believe it is possible to bring an "attention" chance ( by subtracting the previous board ). Maybe it helps.

At last, I am using 6GPU: 5 Tesla P40(1 for optimaztion, 4 for self play) + 1 Tesla M40(for evaluator). Maybe it is mostly because of the computation force...

mokemokechicken commented 6 years ago

@gooooloo

Thank you for your reply.

I think what is also worthy mentioning is that, although I changed to 800 sims, I didn't make the overall selfplay too much slower. I did this by separating MCTS and Neural Network to different processes. They communicate via named pipes.

Great. I think it is the best implementation.

I believe it is possible to bring an "attention" chance ( by subtracting the previous board ). Maybe it helps.

I see. I could not think of that possibility. It is very interesting.

At last, I am using 6GPU: 5 Tesla P40(1 for optimaztion, 4 for self play) + 1 Tesla M40(for evaluator). Maybe it is mostly because of the computation force...

That's very powerful!! :)

apollo-time commented 6 years ago

@gooooloo Um... Is really useful history? When use history, the player can not play on the one board state. I see some games as chess have must play from some board state that it is not initial state.

gooooloo commented 6 years ago

@apollo-time do you mean the first step of game? As the AlphaGoZero paper mentions, all-zero board are used if there is not enough history boards.

8 feature planes Xt consist of binary values indicating the presence of the current player’s stones (Xti = 1 if intersection i contains a stone of the player’s colour at time-step t; 0 if the intersection is empty, contains an opponent stone, or if t < 0)

"t < 0" is the case here.

apollo-time commented 6 years ago

@gooooloo No, I mean that some game can play from some board state that is not initial state, for example Chess Puzzles.

apollo-time commented 6 years ago

@gooooloo can u beat windows online reversi game level 5?

gooooloo commented 6 years ago

@apollo-time

No, I mean that some game can play from some board state that is not initial state, for example Chess Puzzles.

I see. I don't consider that case.

can u beat windows online reversi game level 5?

I don't have a windows system( I will try to find one ). But I can't beat NBoard's Novello 20 level. ( I can beat 10 level though with 1600 sims per move). Nor the NTest 30 level.

apollo-time commented 6 years ago

@gooooloo thanks, My question is same with Cassandra120's

gooooloo commented 6 years ago

@apollo-time

can u beat windows online reversi game level 5?

I just played with it, using the same model and simulations_per_move(which is 800) with the Lv99 game, and I win online reversi game level 5 ( 2:0 ), lose to level 6 ( 1:3 ).

apollo-time commented 6 years ago

@gooooloo My model(simulations_per_move=800) beats online reversi game level 4 now, and my model don't use history. But do you feel the model improved continuously?

gooooloo commented 6 years ago

@apollo-time I had another new generation model the day before yesterday, but not getting any better model these two days. Let's wait for some more days and see.

AranKomat commented 6 years ago

@gooooloo

After about 100k steps( batch_size = 3072 ),it seemed got stuck and not improving.

That's also the case in AlphaZero. The performance more or less stagnated after that point. But they achieved an already strong performance (with difference board game) at 100k iters not only due to using 800 sims/move but also due to their large architecture and large buffer. Also, they did one iteration of update for each 30 or so games (3M games after 100k iters), which may not be the case in the implementation of @mokemokechicken, Zeta36 and Akababa.

How was your case? Did you use "normal" setting instead of "mini" of config?

gooooloo commented 6 years ago

@AranKomat

... due to their large architecture ...

my config (the network architecture are same as @mokemokechicken 's original implementation) :

class ModelConfig:
    cnn_filter_num = 256
    cnn_filter_size = 3
    res_layer_num = 10
    l2_reg = 1e-4
    value_fc_size = 256
    input_size = (5,8,8) 
    policy_size = 8*8+1

... and large buffer

mine:

class PlayDataConfig:
    def __init__(self):
        self.nb_game_in_file = 50
        self.max_file_num = 1000

class TrainerConfig:
    def __init__(self):
        self.batch_size = 3072
        self.epoch_to_checkpoint = 1
        self.epoch_steps = 100
        self.save_model_steps = 800
        self.lr_schedule = (
            (0.2,    1500),  # means being 0.2 until 1500 steps.
            (0.02,   20000),
            (0.002,  100000),
            (0.0002, 9999999999)
        )

I also change sampling method. I do this because I found in my case(much more play data), @mokemokechicken 's original implementation takes too long waiting for all loaded data got trained at least once before new play data got loaded and before new candidate model got generated.

    def generate_train_data(self, batch_size):
        while True:
            x = []

            for _ in range(batch_size):
                n = randint(0, data_size - 1)
                # sample the nth data and append to x

            yield x

    def train_epoch(self, epochs):
        tc = self.config.trainer
        self.model.model.fit_generator(generator=self.generate_train_data(tc.batch_size),
                                       steps_per_epoch=tc.epoch_steps,
                                       epochs=epochs)
        return tc.epoch_steps * epochs

    def training(self):
        while True:
            self.update_learning_rate()
            steps = self.train_epoch(self.config.trainer.epoch_to_checkpoint)
            self.total_steps += steps

            if last_save_step + self.config.trainer.save_model_steps <= self.total_steps:
                self.save_current_model_as_to_eval()
                last_save_step = self.total_steps

            self.load_play_data()

So basically, I am using "normal" config, but changes a lot of things. Other configs are listed as below if you are interested:

class PlayConfig:
    def __init__(self):
        self.simulation_num_per_move = 800
        self.c_puct = 5
        self.noise_eps = 0.25
        self.dirichlet_alpha = 0.4
        self.change_tau_turn = 10
        self.virtual_loss = 3
        self.prediction_queue_size = 8
        self.parallel_search_num = 8
        self.v_resign_check_min_n = 100
        self.v_resign_init = -0.9
        self.v_resign_delta = 0.01
        self.v_resign_disable_prop = 0.1
        self.v_resign_false_positive_fraction_t_max = 0.05
        self.v_resign_false_positive_fraction_t_min = 0.04
AranKomat commented 6 years ago

@gooooloo Thanks so much for detailed information. Looks like you don't have self.search_threads for multi-threading. Did you find multi-processing only to be sufficient? It's impressive that your sampling method enabled you to finish 200k iters with your large architecture. Looks like Akababa's multiprocessing is very powerful. But I've failed to see how many self-play games you've finished up til 100~200k iters. Have you tracked the number of games?

mokemokechicken commented 6 years ago

@gooooloo @apollo-time @evalon32 @vincentlooi @AranKomat

I created Performance Reports for sharing our achievements, and linked from the top of readme. I would be grateful if you would post it.

gooooloo commented 6 years ago

@AranKomat

Have you tracked the number of games?

No I have not. I wish I had.

gooooloo commented 6 years ago

Hi everyone, my codes getting the model are here: https://github.com/gooooloo/reversi-alpha-zero, if you are interested.