Closed dzy1997 closed 2 years ago
Or should I formulate the fixed agent as a part of the environment and train without multi-agent settings?
This is a good approach in my opinion. You can switch the network parameter through trainer's train_fn
according to the epoch.
test_reward: 1.000000 ± 0.000000
Have you discovered the actual trajectories to see if it is reasonable enough? Not sure but my experience is, maybe it's a bug (e.g., forgetting to set eps).
I have checked my code and set eps properly. But I am still getting no variances at all, and trying to watch the match gives an "result has no attribute 'act' " error in collector.
I appended the following code to tic_tac_toe.py
def watch_selfplay(args, agent):
env = TicTacToeEnv(args.board_size, args.win_size)
agent.set_eps(args.eps_test)
policy = MultiAgentPolicyManager([agent, agent])
policy.eval()
collector = Collector(policy, env)
result = collector.collect(n_episode=1, render=args.render)
rews, lens = result["rews"], result["lens"]
print(f"Final reward: {rews[:, 0].mean()}, length: {lens.mean()}")
def selfplay(args, num_generation=5): # always train first agent, start from random policy
def env_func():
return TicTacToeEnv(args.board_size, args.win_size)
train_envs = DummyVectorEnv([env_func for _ in range(args.training_num)])
test_envs = DummyVectorEnv([env_func for _ in range(args.test_num)])
# seed
np.random.seed(args.seed)
torch.manual_seed(args.seed)
train_envs.seed(args.seed)
test_envs.seed(args.seed)
# model
env = TicTacToeEnv(args.board_size, args.win_size)
args.state_shape = env.observation_space.shape or env.observation_space.n
args.action_shape = env.action_space.shape or env.action_space.n
net = Net(args.state_shape, args.action_shape,
hidden_sizes=args.hidden_sizes, device=args.device
).to(args.device)
optim = torch.optim.Adam(net.parameters(), lr=args.lr)
agent_learn = DQNPolicy(
net, optim, args.gamma, args.n_step,
target_update_freq=args.target_update_freq)
net_fixed = Net(args.state_shape, args.action_shape,
hidden_sizes=args.hidden_sizes, device=args.device
).to(args.device)
optim_fixed = torch.optim.SGD(net_fixed.parameters(), lr=0)
agent_fixed = DQNPolicy(
net_fixed, optim_fixed, args.gamma, args.n_step,
target_update_freq=args.target_update_freq)
# path = os.path.join(args.logdir, 'tic_tac_toe', 'dqn', 'policy.pth')
# agent_fixed.load_state_dict(torch.load(path))
# initialize agents and ma-policy
agents = [agent_learn, agent_fixed]
policy = MultiAgentPolicyManager(agents)
# collector
train_collector = Collector(
policy, train_envs,
VectorReplayBuffer(args.buffer_size, len(train_envs)),
exploration_noise=True)
test_collector = Collector(policy, test_envs)
# policy.set_eps(1)
train_collector.collect(n_step=args.batch_size * args.training_num)
# log
log_path = os.path.join(args.logdir, 'tic_tac_toe', 'dqn')
writer = SummaryWriter(log_path)
writer.add_text("args", str(args))
logger = BasicLogger(writer)
def save_fn(policy):
pass
def stop_fn(mean_rewards):
return mean_rewards >= args.win_rate
def train_fn(epoch, env_step):
policy.policies[0].set_eps(args.eps_train)
policy.policies[1].set_eps(args.eps_train)
def test_fn(epoch, env_step):
policy.policies[0].set_eps(args.eps_test)
policy.policies[1].set_eps(args.eps_test)
def reward_metric(rews):
return rews[:, 0]
# trainer
for i_gen in range(num_generation):
result = offpolicy_trainer(
policy, train_collector, test_collector, args.epoch,
args.step_per_epoch, args.step_per_collect, args.test_num,
args.batch_size, train_fn=train_fn, test_fn=test_fn,
stop_fn=stop_fn, save_fn=save_fn, update_per_step=args.update_per_step,
logger=logger, test_in_train=False, reward_metric=reward_metric)
policy.policies[1].load_state_dict(policy.policies[0].state_dict())
print('==={} Generations Evolved==='.format(i_gen+1))
model_save_path = os.path.join(args.logdir, 'tic_tac_toe', 'dqn', 'policy_selfplay.pth')
torch.save(policy.policies[0].state_dict(), model_save_path)
return result, policy.policies[0]
and edited the test_tic_tac_toe.py
to
import pprint
from tic_tac_toe import get_args, train_agent, watch, selfplay, watch_selfplay
def test_tic_tac_toe(args=get_args()):
if args.watch:
watch(args)
return
# result, agent = train_agent(args)
result, agent = selfplay(args)
assert result["best_reward"] >= args.win_rate
if __name__ == '__main__':
pprint.pprint(result)
# Let's watch its performance!
# watch(args, agent)
watch_selfplay(args, agent)
if __name__ == '__main__':
test_tic_tac_toe(get_args())
I am trying to convert TicTacToeEnv to a single-agent environment (where the opponent is a random agent and part of the environment). Does that mean I need to inherit the class from gym.Env directly? Since gym.Env does not have action masking, how should I change the code so I can still use the mask with minimum changes to fit in the test/discrete/test_dqn.py
?
Thanks for posting the code! I'll take a look this weekend. Btw, it seems I cannot run watch_selfplay
correctly:
Traceback (most recent call last):
File "test_tic_tac_toe.py", line 22, in <module>
test_tic_tac_toe(get_args())
File "test_tic_tac_toe.py", line 18, in test_tic_tac_toe
watch_selfplay(args, agent)
File "/home/trinkle/github/tianshou-new/test/multiagent/tic_tac_toe.py", line 195, in watch_selfplay
result = collector.collect(n_episode=1, render=args.render)
File "/home/trinkle/github/tianshou-new/tianshou/data/collector.py", line 218, in collect
act = to_numpy(result.act)
File "/home/trinkle/github/tianshou-new/tianshou/data/batch.py", line 202, in __getattr__
return getattr(self.__dict__, key)
AttributeError: 'dict' object has no attribute 'act'
And my log is:
Epoch #1: 5001it [00:05, 846.94it/s, agent_1/loss=0.037, agent_2/loss=0.140, env_step=5000, len=7, n/ep=1, n/st=10, rew=1.00]
Epoch #1: test_reward: 1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===1 Generations Evolved===
/home/trinkle/github/tianshou-new/tianshou/trainer/offpolicy.py:87: UserWarning: Please consider using save_checkpoint_fn instead of save_fn.
warnings.warn("Please consider using save_checkpoint_fn instead of save_fn.")
Epoch #1: 5001it [00:05, 970.61it/s, agent_1/loss=0.045, agent_2/loss=0.227, env_step=5000, len=13, n/ep=0, n/st=10, rew=0.00]
Epoch #1: test_reward: 1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===2 Generations Evolved===
Epoch #1: 5001it [00:05, 999.71it/s, agent_1/loss=0.049, agent_2/loss=0.221, env_step=5000, len=11, n/ep=0, n/st=10, rew=1.00]
Epoch #1: test_reward: 1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===3 Generations Evolved===
Epoch #1: 5001it [00:04, 1040.78it/s, agent_1/loss=0.050, agent_2/loss=0.234, env_step=5000, len=12, n/ep=2, n/st=10, rew=1.00]
Epoch #1: test_reward: -1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===4 Generations Evolved===
Epoch #1: 5001it [00:04, 1056.89it/s, agent_1/loss=0.056, agent_2/loss=0.182, env_step=5000, len=21, n/ep=0, n/st=10, rew=0.00]
Epoch #1: test_reward: -1.000000 ± 0.000000, best_reward: 1.000000 ± 0.000000 in #0
===5 Generations Evolved===
[1, 1, 1, -1, -1]
Does that mean I need to inherit the class from gym.Env directly? Since gym.Env does not have action masking, how should I change the code so I can still use the mask with minimum changes to fit in the test/discrete/test_dqn.py?
Yes, inherit from gym.Env, and change the env observation to
{
"obs": obs,
"mask": mask,
}
that should be work.
Your log is exactly what I got with the code I posted. I just fixed the problem with watch_selfplay
that I should deep copy the agent first when constructing multi-agent policy. The function watch_selfplay
looks like this now and it works:
def watch_selfplay(args, agent):
env = TicTacToeEnv(args.board_size, args.win_size)
agent.set_eps(args.eps_test)
policy = MultiAgentPolicyManager([agent, deepcopy(agent)]) # fixed here
policy.eval()
collector = Collector(policy, env)
result = collector.collect(n_episode=1, render=args.render)
rews, lens = result["rews"], result["lens"]
print(f"Final reward: {rews[:, 0].mean()}, length: {lens.mean()}")
But I am still getting rewards with no variances. With some debugging it turned out that the 10 training environments are giving different actions during training, but the 100 testing environments are giving exactly same actions every step during testing after every epoch. Not sure why it's happening though, since I have set eps in test_fn()
.
I know the issue.
test_collector = Collector(policy, test_envs, exploration_noise=True)
Sorry about that, I forgot to change this line in #280, will fix soon.
Btw, your provided example looks great! Are you interested in making a pull request to improve this example?
Well thanks for finding the problem! I mainly followed train_agent()
in the example to write my self-play code. I will be glad to contribute to the repo once I finish the current project!
I am unable to understand why one needs to fix the 2nd agent (by setting lr = 0
) for self-play.
The MultiAgentPolicy trains both agents in parallel; hence, if we initialize the networks with identical parameters, each agent should learn while playing one another. Right?
Instead, at the end of each generation, the networks' information should be shared (by taking an average or something) to restart on the same page.
I am trying to adapt the tic-tac-toe example to train through self-play. I train an agent against a fixed agent (with the same network architecture) until the average win rate reaches 95%. Then I copy the weights from the trained agent to the fixed agent and repeat the process for 10 generations of evolution. My first question is how to fix an agent under multi-agent settings. Currently I am using an optimizer with lr=0 on the agent I want to freeze as an ad hoc solution. But I am not sure it is the correct way, and there should be a solution that does not compute gradients. Should I use some API that exists (but I don't know)? Or should I formulate the fixed agent as a part of the environment and train without multi-agent settings? My second problem is that training has no variances at all (the log at the end of each epoch says
test_reward: 1.000000 ± 0.000000
). Is it because DQN is deterministic at test time and therefore is too easy to beat? What should I change to make meaningful training progress like training against a random policy?