initial-h / AlphaZero_Gomoku_MPI

An asynchronous/parallel method of AlphaGo Zero algorithm with Gomoku
186 stars 43 forks source link

训练15*15的棋盘 #17

Closed Nkust-R105 closed 4 years ago

Nkust-R105 commented 4 years ago

不好意思 我原本的棋盘大小是99的 但是我现在把棋盘改成1515 我要怎么去重新训练它

initial-h commented 4 years ago

把棋盘大小改成15*15 https://github.com/initial-h/AlphaZero_Gomoku_MPI/blob/b1cc50ab59121ce1d03203c11a0665049abf83c6/train.py#L24-L25 然后在TrainPipeline函数里把你的transfer model路径写上,比如这里注释的这行。 https://github.com/initial-h/AlphaZero_Gomoku_MPI/blob/b1cc50ab59121ce1d03203c11a0665049abf83c6/train.py#L261-L262 plus, 如果你训大棋盘的时候遇到了第一步不下中间位置,可以参考这个issue #6

Nkust-R105 commented 4 years ago

所以我执行train.py之后会他会自己帮我建立两个model(current_policy.mode跟best_policy.model)是嗎?

initial-h commented 4 years ago

对的,分别存在tmp和model文件夹下。 建议用多进程训练,单进程很慢。

Nkust-R105 commented 4 years ago

不好意思我想请问一下这些数据都代表着什么意思 batch i:100, episode_len:58 kl:0.02112,lr_multiplier:0.132,loss:4.146144390106201,entropy:4.015280723571777,explained_var_old:0.788,explained_var_new:0.812 current self-play batch: 100

Nkust-R105 commented 4 years ago

请问这段程式是在做什么的 def policy_update(self): ''' update the policy-value net '''

play_data: [(state, mcts_prob, winner_z), ..., ...]

    # train an epoch

    tmp_buffer = np.array(self.data_buffer)
    np.random.shuffle(tmp_buffer)
    steps = len(tmp_buffer)//self.batch_size
    print('tmp buffer: {}, steps: {}'.format(len(tmp_buffer),steps))
    for i in range(steps):
        mini_batch = tmp_buffer[i*self.batch_size:(i+1)*self.batch_size]
        state_batch = [data[0] for data in mini_batch]
        mcts_probs_batch = [data[1] for data in mini_batch]
        winner_batch = [data[2] for data in mini_batch]

        old_probs, old_v = self.policy_value_net.policy_value(state_batch=state_batch,
                                                              actin_fc=self.policy_value_net.action_fc_test,
                                                              evaluation_fc=self.policy_value_net.evaluation_fc2_test)
        loss, entropy = self.policy_value_net.train_step(state_batch,
                                                         mcts_probs_batch,
                                                         winner_batch,
                                                         self.learn_rate)
        new_probs, new_v = self.policy_value_net.policy_value(state_batch=state_batch,
                                                              actin_fc=self.policy_value_net.action_fc_test,
                                                              evaluation_fc=self.policy_value_net.evaluation_fc2_test)
        kl = np.mean(np.sum(old_probs * (
                np.log(old_probs + 1e-10) - np.log(new_probs + 1e-10)),
                axis=1)
        )

        explained_var_old = (1 -
                             np.var(np.array(winner_batch) - old_v.flatten()) /
                             np.var(np.array(winner_batch)))
        explained_var_new = (1 -
                             np.var(np.array(winner_batch) - new_v.flatten()) /
                             np.var(np.array(winner_batch)))

        if steps<10 or (i%(steps//10)==0):
            # print some information, not too much
            print('batch: {},length: {}'
                  'kl:{:.5f},'
                  'loss:{},'
                  'entropy:{},'
                  'explained_var_old:{:.3f},'
                  'explained_var_new:{:.3f}'.format(i,
                                                    len(mini_batch),
                                                    kl,
                                                    loss,
                                                    entropy,
                                                    explained_var_old,
                                                    explained_var_new))

    return loss, entropy
initial-h commented 4 years ago

batch i 是收集的第i盘棋,episode_len看看这盘棋下了多少步 kl 是算的kl散度,lr_multiplier是学习率前面乘的系数(我貌似没用这个参数了),loss是网络训练的当前loss,entropy是输出动作的熵,用来监控训练过程的,explained_var_old和explained_var_new分别是train之前和之后和真实值的差异,越接近1差异越小,也是监控训练过程用的。 current self-play batch就是评估模型效果的时候计数用的,如果模型好就替换掉之前的。这里只是打印出来看看当前是第几次评估,可能我不应该写batch,意思表达不清楚。 后面这段代码就是在更新网络,先把数据打乱,再依次抽一个batch训练,然后后面输出一些指标监控这次训练的效果如何。

massawuyh commented 3 years ago

batch i:48, episode_len:57 tmp buffer: 15192, steps: 29 batch: 0,length:512kl:0.00902,loss:4.8271,entropy:4.6336517333984375,explained_var_old:0.889,explained_var_new:0.923

请问这里batch i,batch 是什么分别呀? length512,tmp buffer: 15192, steps: 29 這些代表什麼意思呀?