ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

[RLlib][MBMPO] The algorithm does not learn as intended. #40400

Open MrSVF opened 1 year ago

MrSVF commented 1 year ago

What happened + What you expected to happen

This PR https://github.com/ray-project/ray/pull/39654 fixes issues within MBMPO that were causing the training to start impossible. However, after applying these corrections and starting training with Chartpole environment, the reward does not grow during training using mbmpo, as it happens with other algorithms. The training parameters are taken from here rllib/tuned_examples/mbmpo/cartpole-mbmpo.yaml. I tried different gym environments, the results were similar. For example, the code below using the PPO algorithm gives an average reward on Chartpole of more than 200 in just 10 steps:

image

But if you run the code below with mbmpo, then even after walking about 100 steps, the average reward will be about 10:

image

A typical reward picture during the MBMPO training process looks like this (the picture was obtained on the MountainCar environment): image

In addition, the following error appears randomly from time to time:

image

Versions / Dependencies

Versions: OS: Ubuntu 22.04 python: 3.9.18 ray: 2.7.0 torch: 2.0.1

Reproduction script

from ray.rllib.algorithms.mbmpo import MBMPOConfig import numpy as np import gymnasium as gym

class my_CartPoleWrapper(gym.Wrapper): _max_episode_steps = 500

def __init__(self, env: gym.Env, **kwargs):
    env = gym.make("CartPole-v1", **kwargs)
    gym.Wrapper.__init__(self, env)

def reward(self, obs, action, obs_next):
    x = obs_next[:, 0]
    theta = obs_next[:, 2]

    rew = 1.0 - (
        (x < -self.x_threshold)
        | (x > self.x_threshold)
        | (theta < -self.theta_threshold_radians)
        | (theta > self.theta_threshold_radians)
    ).astype(np.float32)
    return rew

algo_mbmpo = ( MBMPOConfig() .training( train_batch_size=512, inner_adaptation_steps=1, maml_optimizer_steps=8, num_mamlsteps=15, gamma = 0.99, lambda = 1.0, lr = 0.001, clip_param = 0.5, kl_target = 0.003, kl_coeff = 0.0000000001, inner_lr = 1e-3, model = { "fcnet_hiddens": [32, 32], "free_log_std": True, }, ) .rollouts(num_rollout_workers=1) .resources(num_gpus=0) .framework("torch") .environment(env=my_CartPoleWrapper, ) ) algo = algo_mbmpo.build()

mean_rews = [] for i in range(1000): res = algo.train() print("RES:", i, res["episode_reward_max"], res["episode_reward_mean"], res["episode_reward_min"]) mean_rews.append(res["episode_reward_mean"]) print(mean_rews)

Issue Severity

High: It blocks me from completing my task.

sven1977 commented 1 year ago

Hey @MrSVF , thanks for raising this issue. We are very sorry, but MBMPO will be moved for Ray 2.8 into the rllib_contrib repo (outside of RLlib) and will no longer receive further support by the team. See here for more information on our contrib efforts: https://github.com/ray-project/ray/tree/master/rllib_contrib

MrSVF commented 1 year ago

@sven1977 Thanks for the answer!