thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.97k stars 1.13k forks source link

How to do self-play correctly for tic-tac-toe? #381

Closed dzy1997 closed 2 years ago

dzy1997 commented 3 years ago

I am trying to adapt the tic-tac-toe example to train through self-play. I train an agent against a fixed agent (with the same network architecture) until the average win rate reaches 95%. Then I copy the weights from the trained agent to the fixed agent and repeat the process for 10 generations of evolution. My first question is how to fix an agent under multi-agent settings. Currently I am using an optimizer with lr=0 on the agent I want to freeze as an ad hoc solution. But I am not sure it is the correct way, and there should be a solution that does not compute gradients. Should I use some API that exists (but I don't know)? Or should I formulate the fixed agent as a part of the environment and train without multi-agent settings? My second problem is that training has no variances at all (the log at the end of each epoch says test_reward: 1.000000 ± 0.000000). Is it because DQN is deterministic at test time and therefore is too easy to beat? What should I change to make meaningful training progress like training against a random policy?

Trinkle23897 commented 3 years ago

Or should I formulate the fixed agent as a part of the environment and train without multi-agent settings?

This is a good approach in my opinion. You can switch the network parameter through trainer's train_fn according to the epoch.

test_reward: 1.000000 ± 0.000000

Have you discovered the actual trajectories to see if it is reasonable enough? Not sure but my experience is, maybe it's a bug (e.g., forgetting to set eps).

dzy1997 commented 3 years ago

I have checked my code and set eps properly. But I am still getting no variances at all, and trying to watch the match gives an "result has no attribute 'act' " error in collector. I appended the following code to tic_tac_toe.py

def watch_selfplay(args, agent):
    env = TicTacToeEnv(args.board_size, args.win_size)
    agent.set_eps(args.eps_test)
    policy = MultiAgentPolicyManager([agent, agent])
    policy.eval()
    collector = Collector(policy, env)
    result = collector.collect(n_episode=1, render=args.render)
    rews, lens = result["rews"], result["lens"]
    print(f"Final reward: {rews[:, 0].mean()}, length: {lens.mean()}")

def selfplay(args, num_generation=5): # always train first agent, start from random policy
    def env_func():
        return TicTacToeEnv(args.board_size, args.win_size)
    train_envs = DummyVectorEnv([env_func for _ in range(args.training_num)])
    test_envs = DummyVectorEnv([env_func for _ in range(args.test_num)])
    # seed
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    train_envs.seed(args.seed)
    test_envs.seed(args.seed)

    # model
    env = TicTacToeEnv(args.board_size, args.win_size)
    args.state_shape = env.observation_space.shape or env.observation_space.n
    args.action_shape = env.action_space.shape or env.action_space.n
    net = Net(args.state_shape, args.action_shape,
            hidden_sizes=args.hidden_sizes, device=args.device
            ).to(args.device)
    optim = torch.optim.Adam(net.parameters(), lr=args.lr)
    agent_learn = DQNPolicy(
        net, optim, args.gamma, args.n_step,
        target_update_freq=args.target_update_freq)

    net_fixed = Net(args.state_shape, args.action_shape,
            hidden_sizes=args.hidden_sizes, device=args.device
            ).to(args.device)
    optim_fixed = torch.optim.SGD(net_fixed.parameters(), lr=0)
    agent_fixed = DQNPolicy(
        net_fixed, optim_fixed, args.gamma, args.n_step,
        target_update_freq=args.target_update_freq)
    # path = os.path.join(args.logdir, 'tic_tac_toe', 'dqn', 'policy.pth')
    # agent_fixed.load_state_dict(torch.load(path))

    # initialize agents and ma-policy
    agents = [agent_learn, agent_fixed]
    policy = MultiAgentPolicyManager(agents)

    # collector
    train_collector = Collector(
        policy, train_envs,
        VectorReplayBuffer(args.buffer_size, len(train_envs)),
        exploration_noise=True)
    test_collector = Collector(policy, test_envs)
    # policy.set_eps(1)
    train_collector.collect(n_step=args.batch_size * args.training_num)
    # log
    log_path = os.path.join(args.logdir, 'tic_tac_toe', 'dqn')
    writer = SummaryWriter(log_path)
    writer.add_text("args", str(args))
    logger = BasicLogger(writer)

    def save_fn(policy):
        pass

    def stop_fn(mean_rewards):
        return mean_rewards >= args.win_rate

    def train_fn(epoch, env_step):
        policy.policies[0].set_eps(args.eps_train)
        policy.policies[1].set_eps(args.eps_train)

    def test_fn(epoch, env_step):
        policy.policies[0].set_eps(args.eps_test)
        policy.policies[1].set_eps(args.eps_test)

    def reward_metric(rews):
        return rews[:, 0]

    # trainer
    for i_gen in range(num_generation):
        result = offpolicy_trainer(
            policy, train_collector, test_collector, args.epoch,
            args.step_per_epoch, args.step_per_collect, args.test_num,
            args.batch_size, train_fn=train_fn, test_fn=test_fn,
            stop_fn=stop_fn, save_fn=save_fn, update_per_step=args.update_per_step,
            logger=logger, test_in_train=False, reward_metric=reward_metric)
        policy.policies[1].load_state_dict(policy.policies[0].state_dict())
        print('==={} Generations Evolved==='.format(i_gen+1))

    model_save_path = os.path.join(args.logdir, 'tic_tac_toe', 'dqn', 'policy_selfplay.pth')
    torch.save(policy.policies[0].state_dict(), model_save_path)

    return result, policy.policies[0]

and edited the test_tic_tac_toe.py to

import pprint
from tic_tac_toe import get_args, train_agent, watch, selfplay, watch_selfplay

def test_tic_tac_toe(args=get_args()):
    if args.watch:
        watch(args)
        return

    # result, agent = train_agent(args)
    result, agent = selfplay(args)
    assert result["best_reward"] >= args.win_rate

    if __name__ == '__main__':
        pprint.pprint(result)
        # Let's watch its performance!
        # watch(args, agent)
        watch_selfplay(args, agent)

if __name__ == '__main__':
    test_tic_tac_toe(get_args())
dzy1997 commented 3 years ago

I am trying to convert TicTacToeEnv to a single-agent environment (where the opponent is a random agent and part of the environment). Does that mean I need to inherit the class from gym.Env directly? Since gym.Env does not have action masking, how should I change the code so I can still use the mask with minimum changes to fit in the test/discrete/test_dqn.py?

Trinkle23897 commented 3 years ago

Thanks for posting the code! I'll take a look this weekend. Btw, it seems I cannot run watch_selfplay correctly:

Traceback (most recent call last):
  File "test_tic_tac_toe.py", line 22, in <module>
    test_tic_tac_toe(get_args())
  File "test_tic_tac_toe.py", line 18, in test_tic_tac_toe
    watch_selfplay(args, agent)
  File "/home/trinkle/github/tianshou-new/test/multiagent/tic_tac_toe.py", line 195, in watch_selfplay
    result = collector.collect(n_episode=1, render=args.render)
  File "/home/trinkle/github/tianshou-new/tianshou/data/collector.py", line 218, in collect
    act = to_numpy(result.act)
  File "/home/trinkle/github/tianshou-new/tianshou/data/batch.py", line 202, in __getattr__
    return getattr(self.__dict__, key)
AttributeError: 'dict' object has no attribute 'act'

And my log is:

Epoch #1: 5001it [00:05, 846.94it/s, agent_1/loss=0.037, agent_2/loss=0.140, env_step=5000, len=7, n/ep=1, n/st=10, rew=1.00]                                                                                       
Epoch #1: test_reward: 1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===1 Generations Evolved===
/home/trinkle/github/tianshou-new/tianshou/trainer/offpolicy.py:87: UserWarning: Please consider using save_checkpoint_fn instead of save_fn.
  warnings.warn("Please consider using save_checkpoint_fn instead of save_fn.")
Epoch #1: 5001it [00:05, 970.61it/s, agent_1/loss=0.045, agent_2/loss=0.227, env_step=5000, len=13, n/ep=0, n/st=10, rew=0.00]                                                                                      
Epoch #1: test_reward: 1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===2 Generations Evolved===
Epoch #1: 5001it [00:05, 999.71it/s, agent_1/loss=0.049, agent_2/loss=0.221, env_step=5000, len=11, n/ep=0, n/st=10, rew=1.00]                                                                                      
Epoch #1: test_reward: 1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===3 Generations Evolved===
Epoch #1: 5001it [00:04, 1040.78it/s, agent_1/loss=0.050, agent_2/loss=0.234, env_step=5000, len=12, n/ep=2, n/st=10, rew=1.00]                                                                                     
Epoch #1: test_reward: -1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===4 Generations Evolved===
Epoch #1: 5001it [00:04, 1056.89it/s, agent_1/loss=0.056, agent_2/loss=0.182, env_step=5000, len=21, n/ep=0, n/st=10, rew=0.00]                                                                                     
Epoch #1: test_reward: -1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===5 Generations Evolved===

[1, 1, 1, -1, -1]

Does that mean I need to inherit the class from gym.Env directly? Since gym.Env does not have action masking, how should I change the code so I can still use the mask with minimum changes to fit in the test/discrete/test_dqn.py?

Yes, inherit from gym.Env, and change the env observation to

{
  "obs": obs,
  "mask": mask,
}

that should be work.

dzy1997 commented 3 years ago

Your log is exactly what I got with the code I posted. I just fixed the problem with watch_selfplay that I should deep copy the agent first when constructing multi-agent policy. The function watch_selfplay looks like this now and it works:

def watch_selfplay(args, agent):
    env = TicTacToeEnv(args.board_size, args.win_size)
    agent.set_eps(args.eps_test)
    policy = MultiAgentPolicyManager([agent, deepcopy(agent)]) # fixed here
    policy.eval()
    collector = Collector(policy, env)
    result = collector.collect(n_episode=1, render=args.render)
    rews, lens = result["rews"], result["lens"]
    print(f"Final reward: {rews[:, 0].mean()}, length: {lens.mean()}")

But I am still getting rewards with no variances. With some debugging it turned out that the 10 training environments are giving different actions during training, but the 100 testing environments are giving exactly same actions every step during testing after every epoch. Not sure why it's happening though, since I have set eps in test_fn().

Trinkle23897 commented 3 years ago

I know the issue.

    test_collector = Collector(policy, test_envs, exploration_noise=True)

Sorry about that, I forgot to change this line in #280, will fix soon.

Trinkle23897 commented 3 years ago

Btw, your provided example looks great! Are you interested in making a pull request to improve this example?

dzy1997 commented 3 years ago

Well thanks for finding the problem! I mainly followed train_agent() in the example to write my self-play code. I will be glad to contribute to the repo once I finish the current project!

adi-vc commented 5 months ago

I am unable to understand why one needs to fix the 2nd agent (by setting lr = 0) for self-play.

The MultiAgentPolicy trains both agents in parallel; hence, if we initialize the networks with identical parameters, each agent should learn while playing one another. Right?

Instead, at the end of each generation, the networks' information should be shared (by taking an average or something) to restart on the same page.