Open moizuet opened 1 year ago
Hey. Unfortunately we do not have time to offer custom tech support for custom environments. The library code is tested to function (mostly) correctly, so my knee-jerk reply is that something may be off in your environment. I would recommend two things:
1) Try using stable-baselines3, as it is more mantained.
2) Use the check_env
tool to check your environment (see docs here). This is part of SB3.
I have checked my environment with check_env
but unfortunately it is still giving me the error.
By the way I forgot to show the tensorboard plot, which is shown in the following figure (with the straight horizontal line for episode reward plot). I think @Miffyli , you are right, I am starting considering to migrate to stable_baselines3 (at least my next research project will not be in stable-baselines2).
But my code base is very long (spectral normalization, dense connection, custom "amsgrad" optimizer implementation and custom q-value network method for Soft Actor-Critic for the implementation of wolpertinger algorithm) which will be major cause my hesitation.
Unfortunately I do not have other tips to give and no time to start digging through custom code to find errors :( . I know this is a very bad, maybe even rude-ish, answer which assumes it is an user error, but there are many parts where env implementation can go wrong and cause confusing stuff like this. If possible, I would recommend taking an environment where the rewards work as expected and start changing it towards your final env step-by-step.
No problem, I am trying resolving it, I will report the reasons as soon as I find them out. By the way, I want to ask a question that in case we will use stable-baselines3 which is using Pytorch (i.e., eager mode of exacuation), will the training be slow relative to TensorFlow version of stable-baselines which is using graph mode (much faster computations)?
I think in SB3 other things become a bottleneck before the eager mode of PyTorch is the slowing down factor: handling the data, computing returns, etc etc takes much more time than actually running in the network graph. I personally do not know of the performance beyond RL, but AFAIK it is not worth the effort to change to TF2 just to get bit of speed boost.
I think, if the number of CPUs (for parallel rollouts) are much larger then the number of GPU SMs, then the data will always be available for training and GPUs will always be busy, and thus it may be the case that the eager mode will become the bottle neck (which I think the same that it may not be too severe). Thanks alot!!
I am using reinforcement learning for mathematical optimization, using PPO2 agent in google colab. In case of my custom environment, episode rewards are remaining zero when I saw the tensorboard. Also when I use print statement to print out the "true_reward" inside "ppo2.py" file (as shown in the figure), then I am getting nothing but zero vector.
Due to this, my agent is not learning correctly.
The following things are important to note here: