nsidn98 / InforMARL

Code for our paper: Scalable Multi-Agent Reinforcement Learning through Intelligent Information Aggregation
https://nsidn98.github.io/InforMARL/
MIT License
91 stars 22 forks source link

A Minor question about env.reset #3

Closed DH-O closed 1 year ago

DH-O commented 1 year ago

Hi @nsidn98 Thank you for all your kind responses. Those have helped me a lot.

I successfully got train results, but I wonder that does the environment reset after the end of the episode. In my opinion, after the end of the episode, the environment should be reset.

According to the code below, which is from graph_mpe_runner.py, env.reset only activates only once when `run' is called.

I wonder if is it okay env.reset activates once for the whole training, even though there are multiprocessing of 128(= --num_rollout_threads) running. if it is so, then only 128 rollouts are done without an environment reset.

The answers that I want to hear are as follows

  1. Is the environment reset called only once for the entire training?
  2. If it is so, then Are the rollouts done 128 times for one training? Also, according to the code below, there is a for loop of which ranges are `self.episode_length'. Is it okay that env.reset is not done after the end of the episode?

Thank you!

def run(self):
        self.warmup()   

        start = time.time()
        episodes = int(self.num_env_steps) // self.episode_length // self.n_rollout_threads

        # This is where the episodes are actually run.
        for episode in range(episodes):
            if self.use_linear_lr_decay:
                self.trainer.policy.lr_decay(episode, episodes)

            for step in range(self.episode_length):
                # Sample actions
                values, actions, action_log_probs, rnn_states, \
                    rnn_states_critic, actions_env = self.collect(step)

                # Obs reward and next obs
                obs, agent_id, node_obs, adj, rewards, \
                    dones, infos = self.envs.step(actions_env)

                data = (obs, agent_id, node_obs, adj, agent_id, rewards, 
                        dones, infos, values, actions, action_log_probs, 
                        rnn_states, rnn_states_critic)

                # insert data into buffer
                self.insert(data)

            # compute return and update network
            self.compute()
            train_infos = self.train()

            # post process
            total_num_steps = (episode + 1) * self.episode_length * self.n_rollout_threads

            # save model
            if (episode % self.save_interval == 0 or episode == episodes - 1):
                self.save()

            # log information
            if episode % self.log_interval == 0:
                end = time.time()

                env_infos = self.process_infos(infos)

                avg_ep_rew = np.mean(self.buffer.rewards) * self.episode_length
                train_infos["average_episode_rewards"] = avg_ep_rew
                print(f"Average episode rewards is {avg_ep_rew:.3f} \t"
                    f"Total timesteps: {total_num_steps} \t "
                    f"Percentage complete {total_num_steps / self.num_env_steps * 100:.3f}")
                self.log_train(train_infos, total_num_steps)
                self.log_env(env_infos, total_num_steps)

            # eval
            if episode % self.eval_interval == 0 and self.use_eval:
                self.eval(total_num_steps)
nsidn98 commented 1 year ago

Hi @DH-O,

  1. No, the reset function is called every time the episode ends. We have included a reset function in the step itself when all agents are done with their tasks. You can check this line.
  2. We cap the max episode size for all the agents (eg. 25, 50 steps) and hence we loop over that range. Number of rollouts is basically running num_rollouts parallel environments with the same policy network to collect data faster.

I hope this answers your questions:)

nsidn98 commented 1 year ago

@DH-O Can I close this comment now?

DH-O commented 1 year ago

@nsidn98 Oops I forgot to close it. thank you for all your help :)