thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.95k stars 1.13k forks source link

Collector sampling with multiple environment does not seem to be unbiased with n_episodes #1042

Open utkarshp opened 9 months ago

utkarshp commented 9 months ago

I made an attempt at fixing this as follows. Essentially, this attempt makes the envs that finished wait on those that have not, and then calls reset on all at once.

+    def _reset_env_to_next(self, gym_reset_kwargs: Optional[Dict[str, Any]] = None) -> None:
+        gym_reset_kwargs = gym_reset_kwargs if gym_reset_kwargs else {}
+        obs, info = self.env.reset(**gym_reset_kwargs)
+        if self.preprocess_fn:
+            processed_data = self.preprocess_fn(
+                obs=obs, info=info, env_id=np.arange(self.env_num)
+            )
+            obs = processed_data.get("obs", obs)
+            info = processed_data.get("info", info)
+        self.data = Batch(
+            obs={},
+            act={},
+            rew={},
+            terminated={},
+            truncated={},
+            done={},
+            obs_next={},
+            info={},
+            policy={}
+        )
+        self.data.info = info
+        self.data.obs_next = obs
+
     def collect(
             self,
             n_step: Optional[int] = None,
@@ -69,8 +92,10 @@
                     "which may cause extra transitions collected into the buffer."
                 )
             ready_env_ids = np.arange(self.env_num)
+            init_ready_env_ids = np.arange(self.env_num)
         elif n_episode is not None:
             assert n_episode > 0
+            init_ready_env_ids = np.arange(min(self.env_num, n_episode))
             ready_env_ids = np.arange(min(self.env_num, n_episode))
             self.data = self.data[:min(self.env_num, n_episode)]
         else:
@@ -170,24 +195,18 @@
                 episode_lens.append(ep_len[env_ind_local])
                 episode_rews.append(ep_rew[env_ind_local])
                 episode_start_indices.append(ep_idx[env_ind_local])
-                # now we copy obs_next to obs, but since there might be
-                # finished episodes, we have to reset finished envs first.
-                self._reset_env_with_ids(
-                    env_ind_local, env_ind_global, gym_reset_kwargs
-                )
-                for i in env_ind_local:
-                    self._reset_state(i)
+                unfinished_ind_local = np.where(~done)[0]

                 # remove surplus env id from ready_env_ids
                 # to avoid bias in selecting environments
                 if n_episode:
-                    surplus_env_num = len(ready_env_ids) - (n_episode - episode_count)
-                    if surplus_env_num > 0:
-                        mask = np.ones_like(ready_env_ids, dtype=bool)
-                        mask[env_ind_local[:surplus_env_num]] = False
-                        ready_env_ids = ready_env_ids[mask]
-                        self.data = self.data[mask]
-
+                    ready_env_ids = ready_env_ids[unfinished_ind_local]
+                    self.data = self.data[unfinished_ind_local]
+                    if len(unfinished_ind_local) == 0:
+                        self._reset_env_to_next(gym_reset_kwargs)
+                        ready_env_ids = init_ready_env_ids
+                        for i in ready_env_ids:
+                            self._reset_state(i)
             self.data.obs = self.data.obs_next

             if (n_step and step_count >= n_step) or \
MischaPanch commented 9 months ago

Thank you for the detailed description, I'm gonna think about this tomorrow and give a proper answer.

But already now the situation you describe seems unusual - during training you would usually parallelize different instantiations of the same environment type. Why would part of them be slower than the others? Do you have a specific use-case in mind?

utkarshp commented 9 months ago

Actually, I thought to give a deterministic example to make the issue clearer, but I now realize that was not necessary and might have caused confusion.

In my specific situation, the episode lengths can vary a lot between episodes. Different environments are not inherently different, but if some envs get stuck with long episodes, then this problem can arise.

I should have started with this example, my apologies. Hope it is clearer now.

maxhuettenrauch commented 9 months ago

Hi @utkarshp, thanks for the issue. We recently had a discussion around this very topic and currently I believe that if all of your envs are created equal (i.e., the length of the environment is just due to some randomness), then you are simply not collecting enough episodes to capture the variance between short and long episodes of your task. Deliberately stopping the first envs and "handpicking" some of the longer runs might actually skew the statistics in the same way as collecting too few episodes and thus only end up with short ones. The solution would be to either reduce the number of envs to collect more episodes per env, or increase the number of collected episodes. Please let me know if you have other reasons why you would like to have each env sample two episodes. We will also perform some testing whether there are significant differences if episodes come only from a small subset of available envs, or equally from all envs.

utkarshp commented 9 months ago

The only reason is time. When I am training, I am collecting lots of episodes from each env, and this is not such an issue. When testing, I want to be able to test on the same test set each time I test. Furthermore, I have a baseline algorithm that also needs to run on the same test set so that I can compare it with the RL agent's performance. I have 100 pre-selected episodes (=seeds) that I want to run all tests on, and my foremost requirement is that each of these episodes be evaluated once so that a fair comparison can be made with the baseline algorithm, and past iterations. Of course I plan on trying with different test sets for robustness, but currently I am only looking for reproducible behavior on a fixed test set. Since there is no good way (that I know of) to remove envs that are done with their episodes from the rotation, I elected to divide the episodes equally into different envs, hoping that it would work. Is there a better way to achieve this?

Anyway, each episode takes around 10-15 minutes. If I parallelize into 50 threads, I am done in 30 minutes for each test. Otherwise, it takes longer. I have tried with 20-25 envs (with 5-4 episodes each), and this is still an issue. I suppose I could also try 10 envs with 10 episodes each and reduce my testing frequency.

In all, I still think that enforcing an equal number of episodes would be a desirable behavior. But now I think perhaps this could be a feature request.

utkarshp commented 9 months ago

A bit unrelated perhaps, but I guess my problem can also be solved through some sort of interprocess communication between my test envs. Currently I pass episode start seeds at test env init so that each env can execute the episodes they are assigned. Once an env is done with its two episodes, the next reset and step produce garbage that I don't want. If the processes could communicate, then the faster envs can take over the load from the slower envs and things could still be made to work. While this seems to be more complicated and prone to errors, it may be worth exploring in case there is a sufficient speedup.

bordeauxred commented 9 months ago

Hi @utkarshp, thanks for raising this issue and your proposed solution. Based on this, I will open a PR addressing the request later on.

Using the sequential execution where each env has to terminate in each batch is as you rightly point out unbiased. It might be computationally quite undesirable through. Consider for example an env that takes in 90% of cases t steps and T steps otherwise. In the batch sequential for 10 test env with 10 runs each, the batch sequential run time is 10T, whereas there is a good chance of T + 9t in the asynchronous case.

As your use case is reproducibility at test time instead of performance, the batch sequential fashion looks alright through.

On a related note, do you find substantially different behaviour for running each of the 100 test env seeds one time vs. performing 100 runs in one test env (passing only the seed for the first reset, fixing the sequence of random numbers/ starting states for the subsequent envs as well )?

utkarshp commented 9 months ago

Yes, I agree with the computational toll this would take. Perhaps the collector could be given an additional non-default argument (unbiased=False) that can be set to True to ensure this behavior?

I am not entirely sure I fully understand your question, but I will try to answer. With the current implementation, yes I do find substantially different behavior in the following two cases:

  1. Pass 100 episode start states to a single test env
  2. Split the 100 start states into sequences of, say, two each and passing them to 50 test envs.

In the second case, I find that some of the envs are done with their allotted episodes and start on newer ones. So I get some unwanted episodes, but miss out on those that I wanted.

Please let me know if this answers your question.

bordeauxred commented 9 months ago

Agree, it will be a non default argument.

Yes, I see how you would get different results in the two cases you describe.

What I meant is slightly different, let me try to be more clear.

By passing start states, you mean passing seeds?

Could you please elaborate a bit on the environment you are using (like deterministic or stochastic, has the seed only impact on the initial state distribution or also sth. else, how different are the starting states from the initial state distribution)?

The two cases I meant: 1.) Having one test env, that is reset with seed in the first call, env.reset() in the next (99) calls is invoked without explicit seed argument, thus you get a different initial state (and maybe in addition sth. else in your task) in each case. 2.) Calling env.reset(seed) for 100 different seed values (parallel or sequential, doesn't matter).

For a sufficiently large number of runs, both approaches should converge to the same evaluation statistics.

utkarshp commented 9 months ago

My env is deterministic, so that taking the same action in the same state would lead to the same reward and state transition. So passing 100 start states would be the same as passing 100 seeds and then reseeding with the next seed at reset time.

I agree that the situations you described should converge to substantially the same evaluation statistic. The problem in my case is that for any policy running on a set of episodes, the rewards tend to have a really large standard deviation. Comparing different policies on different sets of episodes is then a challenge since their confidence intervals tend to overlap. However, if I can show that for the same set of episodes, one policy is better than the other in terms of mean rewards, and that this holds for multiple sets of episodes, then I have a concrete result about which policy is better.

Does this make sense? Please do let me know if you think there is a flaw in my reasoning.

MischaPanch commented 8 months ago

We'll have to do #1058 first. The Collector is too convoluted to be touched, so we'll improve the situation in multiple steps