Open MrSVF opened 1 year ago
Hey @MrSVF , thanks for raising this issue. We are very sorry, but MBMPO will be moved for Ray 2.8 into the rllib_contrib
repo (outside of RLlib) and will no longer receive further support by the team.
See here for more information on our contrib efforts: https://github.com/ray-project/ray/tree/master/rllib_contrib
@sven1977 Thanks for the answer!
What happened + What you expected to happen
This PR https://github.com/ray-project/ray/pull/39654 fixes issues within MBMPO that were causing the training to start impossible. However, after applying these corrections and starting training with Chartpole environment, the reward does not grow during training using mbmpo, as it happens with other algorithms. The training parameters are taken from here rllib/tuned_examples/mbmpo/cartpole-mbmpo.yaml. I tried different gym environments, the results were similar. For example, the code below using the PPO algorithm gives an average reward on Chartpole of more than 200 in just 10 steps:
But if you run the code below with mbmpo, then even after walking about 100 steps, the average reward will be about 10:
A typical reward picture during the MBMPO training process looks like this (the picture was obtained on the MountainCar environment):
In addition, the following error appears randomly from time to time:
Versions / Dependencies
Versions: OS: Ubuntu 22.04 python: 3.9.18 ray: 2.7.0 torch: 2.0.1
Reproduction script
from ray.rllib.algorithms.mbmpo import MBMPOConfig import numpy as np import gymnasium as gym
class my_CartPoleWrapper(gym.Wrapper): _max_episode_steps = 500
algo_mbmpo = ( MBMPOConfig() .training( train_batch_size=512, inner_adaptation_steps=1, maml_optimizer_steps=8, num_mamlsteps=15, gamma = 0.99, lambda = 1.0, lr = 0.001, clip_param = 0.5, kl_target = 0.003, kl_coeff = 0.0000000001, inner_lr = 1e-3, model = { "fcnet_hiddens": [32, 32], "free_log_std": True, }, ) .rollouts(num_rollout_workers=1) .resources(num_gpus=0) .framework("torch") .environment(env=my_CartPoleWrapper, ) ) algo = algo_mbmpo.build()
mean_rews = [] for i in range(1000): res = algo.train() print("RES:", i, res["episode_reward_max"], res["episode_reward_mean"], res["episode_reward_min"]) mean_rews.append(res["episode_reward_mean"]) print(mean_rews)
Issue Severity
High: It blocks me from completing my task.