Closed AmmarRashed closed 2 years ago
I think it probably has to do with the learning rate, especially the critic learning rate.
Hi @AmmarRashed, I highly recommend asking these type of questions is in https://discuss.ray.io (e.g. questions about why a certain algorithm may not be learning, or if you have a custom application that you wanna use rllib for and you wanna see how). The community is more active there and can unblock you faster than if had submitted an issue here.
I'd always start with the default parameters and if you do that here it works. I tried your code and it indeed is going down. I assume there are some differences between how the parameters are processed.
orange is your code and red is the code below (This is for ray 1.13 onwards):
import ray
from ray import tune
from ray.rllib.algorithms.sac import SACConfig, SAC
config = (
SACConfig()
.environment(env="CartPole-v0")
.framework("tf")
)
ray.init()
a = tune.run(
SAC,
name="SAC-CartPole",
config=config.to_dict(),
stop={
'timesteps_total':100000,
'episode_reward_mean':150.0
}
)
Hi @AmmarRashed, I highly recommend asking these type of questions is in https://discuss.ray.io (e.g. questions about why a certain algorithm may not be learning, or if you have a custom application that you wanna use rllib for and you wanna see how). The community is more active there and can unblock you faster than if had submitted an issue here.
I'd always start with the default parameters and if you do that here it works. I tried your code and it indeed is going down. I assume there are some differences between how the parameters are processed.
orange is your code and red is the code below (This is for ray 1.13 onwards):
import ray from ray import tune from ray.rllib.algorithms.sac import SACConfig, SAC config = ( SACConfig() .environment(env="CartPole-v0") .framework("tf") ) ray.init() a = tune.run( SAC, name="SAC-CartPole", config=config.to_dict(), stop={ 'timesteps_total':100000, 'episode_reward_mean':150.0 } )
Thanks a lot.
What happened + What you expected to happen
So I have been trying with different algorithms, PPO, SAC, ...etc on a custom multi-agent discrete action environment, but the actor loss was consistently negative (which gets minimized further to large negative numbers) and consequently the reward plummets. I suspected the environment, so I tried the tuned example CartPole-v0 with exactly the same configurations, and got the same issue. Seems that the actor loss function needs a sign-flip.
Versions / Dependencies
Running with mimoralea/rldm docker. ray 1.6.0 Tensorflow 2.9.0 torch 1.9.0+cu111
running an RTX 3080 GPU Driver > 515.48.07, CUDA Version: 11.7
Reproduction script
Issue Severity
High: It blocks me from completing my task.