hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.09k stars 728 forks source link

True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments #1167

Open moizuet opened 1 year ago

moizuet commented 1 year ago

I am using reinforcement learning for mathematical optimization, using PPO2 agent in google colab. In case of my custom environment, episode rewards are remaining zero when I saw the tensorboard. Also when I use print statement to print out the "true_reward" inside "ppo2.py" file (as shown in the figure), then I am getting nothing but zero vector.

Due to this, my agent is not learning correctly.

The following things are important to note here:

  1. My environment is giving the agent nonzero rewards (I have checked it thoroughly) but on the agent side the rewards are not being collected.
  2. This thing happens mostly but not always, some times when I install stable-baselines the whole system works perfectly.
  3. This thing happens only with my custom environment and not with other openai gym environments.

image

image

Miffyli commented 1 year ago

Hey. Unfortunately we do not have time to offer custom tech support for custom environments. The library code is tested to function (mostly) correctly, so my knee-jerk reply is that something may be off in your environment. I would recommend two things:

1) Try using stable-baselines3, as it is more mantained. 2) Use the check_env tool to check your environment (see docs here). This is part of SB3.

moizuet commented 1 year ago

I have checked my environment with check_env but unfortunately it is still giving me the error.

By the way I forgot to show the tensorboard plot, which is shown in the following figure (with the straight horizontal line for episode reward plot). I think @Miffyli , you are right, I am starting considering to migrate to stable_baselines3 (at least my next research project will not be in stable-baselines2).

But my code base is very long (spectral normalization, dense connection, custom "amsgrad" optimizer implementation and custom q-value network method for Soft Actor-Critic for the implementation of wolpertinger algorithm) which will be major cause my hesitation.

image

Miffyli commented 1 year ago

Unfortunately I do not have other tips to give and no time to start digging through custom code to find errors :( . I know this is a very bad, maybe even rude-ish, answer which assumes it is an user error, but there are many parts where env implementation can go wrong and cause confusing stuff like this. If possible, I would recommend taking an environment where the rewards work as expected and start changing it towards your final env step-by-step.

moizuet commented 1 year ago

No problem, I am trying resolving it, I will report the reasons as soon as I find them out. By the way, I want to ask a question that in case we will use stable-baselines3 which is using Pytorch (i.e., eager mode of exacuation), will the training be slow relative to TensorFlow version of stable-baselines which is using graph mode (much faster computations)?

Miffyli commented 1 year ago

I think in SB3 other things become a bottleneck before the eager mode of PyTorch is the slowing down factor: handling the data, computing returns, etc etc takes much more time than actually running in the network graph. I personally do not know of the performance beyond RL, but AFAIK it is not worth the effort to change to TF2 just to get bit of speed boost.

moizuet commented 1 year ago

I think, if the number of CPUs (for parallel rollouts) are much larger then the number of GPU SMs, then the data will always be available for training and GPUs will always be busy, and thus it may be the case that the eager mode will become the bottle neck (which I think the same that it may not be too severe). Thanks alot!!