[RLlib] Consistently learning to lose!

AmmarRashed commented 2 years ago

What happened + What you expected to happen

So I have been trying with different algorithms, PPO, SAC, ...etc on a custom multi-agent discrete action environment, but the actor loss was consistently negative (which gets minimized further to large negative numbers) and consequently the reward plummets. I suspected the environment, so I tried the tuned example CartPole-v0 with exactly the same configurations, and got the same issue. Seems that the actor loss function needs a sign-flip.

Versions / Dependencies

Running with mimoralea/rldm docker. ray 1.6.0 Tensorflow 2.9.0 torch 1.9.0+cu111

running an RTX 3080 GPU Driver > 515.48.07, CUDA Version: 11.7

Reproduction script

ray.init()
a = tune.run(
    "SAC",
    name="SAC-CartPole",
    config={
        'env':'CartPole-v0',
        'gamma':0.95,
        'no_done_at_end':'false',
        'target_network_update_freq':32,
        'tau': 1.0,
        'train_batch_size':32,
        'optimization':{
            'actor_learning_rate':0.005,
            'critic_learning_rate': 0.005,
            'entropy_learning_rate': 0.0001,
        }
    },
    stop={
        'timesteps_total':100000,
        'episode_reward_mean':150.0
    }
)

Issue Severity

High: It blocks me from completing my task.

AmmarRashed commented 2 years ago

I think it probably has to do with the learning rate, especially the critic learning rate.

kouroshHakha commented 2 years ago

Hi @AmmarRashed, I highly recommend asking these type of questions is in https://discuss.ray.io (e.g. questions about why a certain algorithm may not be learning, or if you have a custom application that you wanna use rllib for and you wanna see how). The community is more active there and can unblock you faster than if had submitted an issue here.

I'd always start with the default parameters and if you do that here it works. I tried your code and it indeed is going down. I assume there are some differences between how the parameters are processed.

orange is your code and red is the code below (This is for ray 1.13 onwards):

import ray
from ray import tune

from ray.rllib.algorithms.sac import SACConfig, SAC

config = (
    SACConfig()
    .environment(env="CartPole-v0")
    .framework("tf")
)

ray.init()
a = tune.run(
    SAC,
    name="SAC-CartPole",
    config=config.to_dict(),
    stop={
        'timesteps_total':100000,
        'episode_reward_mean':150.0
    }
)

AmmarRashed commented 2 years ago

Hi @AmmarRashed, I highly recommend asking these type of questions is in https://discuss.ray.io (e.g. questions about why a certain algorithm may not be learning, or if you have a custom application that you wanna use rllib for and you wanna see how). The community is more active there and can unblock you faster than if had submitted an issue here.

I'd always start with the default parameters and if you do that here it works. I tried your code and it indeed is going down. I assume there are some differences between how the parameters are processed.

orange is your code and red is the code below (This is for ray 1.13 onwards):
import ray
from ray import tune

from ray.rllib.algorithms.sac import SACConfig, SAC

config = (
    SACConfig()
    .environment(env="CartPole-v0")
    .framework("tf")
)

ray.init()
a = tune.run(
    SAC,
    name="SAC-CartPole",
    config=config.to_dict(),
    stop={
        'timesteps_total':100000,
        'episode_reward_mean':150.0
    }
)

Thanks a lot.

ray-project / ray