[RLlib][MBMPO] The algorithm does not learn as intended.

MrSVF commented 1 year ago

What happened + What you expected to happen

This PR https://github.com/ray-project/ray/pull/39654 fixes issues within MBMPO that were causing the training to start impossible. However, after applying these corrections and starting training with Chartpole environment, the reward does not grow during training using mbmpo, as it happens with other algorithms. The training parameters are taken from here rllib/tuned_examples/mbmpo/cartpole-mbmpo.yaml. I tried different gym environments, the results were similar. For example, the code below using the PPO algorithm gives an average reward on Chartpole of more than 200 in just 10 steps:

But if you run the code below with mbmpo, then even after walking about 100 steps, the average reward will be about 10:

A typical reward picture during the MBMPO training process looks like this (the picture was obtained on the MountainCar environment):

In addition, the following error appears randomly from time to time:

Versions / Dependencies

Versions: OS: Ubuntu 22.04 python: 3.9.18 ray: 2.7.0 torch: 2.0.1

Reproduction script

from ray.rllib.algorithms.mbmpo import MBMPOConfig import numpy as np import gymnasium as gym

class my_CartPoleWrapper(gym.Wrapper): _max_episode_steps = 500

def __init__(self, env: gym.Env, **kwargs):
    env = gym.make("CartPole-v1", **kwargs)
    gym.Wrapper.__init__(self, env)

def reward(self, obs, action, obs_next):
    x = obs_next[:, 0]
    theta = obs_next[:, 2]

    rew = 1.0 - (
        (x < -self.x_threshold)
        | (x > self.x_threshold)
        | (theta < -self.theta_threshold_radians)
        | (theta > self.theta_threshold_radians)
    ).astype(np.float32)
    return rew

algo_mbmpo = ( MBMPOConfig() .training( train_batch_size=512, inner_adaptation_steps=1, maml_optimizer_steps=8, num_mamlsteps=15, gamma = 0.99, lambda = 1.0, lr = 0.001, clip_param = 0.5, kl_target = 0.003, kl_coeff = 0.0000000001, inner_lr = 1e-3, model = { "fcnet_hiddens": [32, 32], "free_log_std": True, }, ) .rollouts(num_rollout_workers=1) .resources(num_gpus=0) .framework("torch") .environment(env=my_CartPoleWrapper, ) ) algo = algo_mbmpo.build()

mean_rews = [] for i in range(1000): res = algo.train() print("RES:", i, res["episode_reward_max"], res["episode_reward_mean"], res["episode_reward_min"]) mean_rews.append(res["episode_reward_mean"]) print(mean_rews)

Issue Severity

High: It blocks me from completing my task.

sven1977 commented 1 year ago

Hey @MrSVF , thanks for raising this issue. We are very sorry, but MBMPO will be moved for Ray 2.8 into the rllib_contrib repo (outside of RLlib) and will no longer receive further support by the team. See here for more information on our contrib efforts: https://github.com/ray-project/ray/tree/master/rllib_contrib

MrSVF commented 1 year ago

@sven1977 Thanks for the answer!

ray-project / ray