[RLlib] MBMPO training stuck on 3.0.0dev0

What happened + What you expected to happen

When using the MBMPO algorithm, the training stops after the first training of the dynamics ensemble. Even hours later, nothing happens.

What I except to happen is a continuing training.

Logs:

/home/juhannc/.virtualenvs/ray-test/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if (distutils.version.LooseVersion(tf.__version__) <
Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-03-03 12:54:49,997 INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(pid=2618554) /home/juhannc/.virtualenvs/ray-test/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(pid=2618554)   if (distutils.version.LooseVersion(tf.__version__) <
(pid=2618561) /home/juhannc/.virtualenvs/ray-test/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(pid=2618561)   if (distutils.version.LooseVersion(tf.__version__) <
(RolloutWorker pid=2618554) 2023-03-03 12:54:56,049     WARNING env.py:156 -- Your env doesn't have a .spec.max_episode_steps attribute. Your horizon will default to infinity, and your environment will not be reset.
(RolloutWorker pid=2618554) 2023-03-03 12:54:56,050     WARNING env.py:166 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
2023-03-03 12:54:56,124 WARNING env.py:156 -- Your env doesn't have a .spec.max_episode_steps attribute. Your horizon will default to infinity, and your environment will not be reset.
2023-03-03 12:54:56,124 WARNING env.py:166 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
Training Dynamics Ensemble - Epoch #0:Train loss: 1.0040107 1.0491492 1.0370538 1.0175241 0.9896876, Valid Loss: 0.78170776 0.5496151 0.6385575 0.64831156 0.80438256,  Moving Avg Valid Loss: 1.153019 0.81068224 0.94187236 0.9562595 1.1864643
[truncated]
Training Dynamics Ensemble - Epoch #402:Train loss: 8.87619e-06 0.00011130612 9.281924e-05 0.0002172304 0.0001362974, Valid Loss: 0.00020231576 0.00031216117 0.0002982092 0.00041011887 0.0004675415,  Moving Avg Valid Loss: 0.00020067632 0.00031215962 0.000298207 0.0004101001 0.00046753784
Stopping Training of Model 0
2023-03-03 12:55:09,855 WARNING deprecation.py:50 -- DeprecationWarning: `remote_workers()` has been deprecated. Accessing the list of remote workers directly through remote_workers() is strongly discouraged. Please try to use one of the foreach accessors that is fault tolerant.  This will raise an error in the future!
2023-03-03 12:55:09,881 INFO trainable.py:172 -- Trainable.setup took 26.571 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.

Versions / Dependencies

Ray: 3.0.0dev0 (commit: 8b55e2d85301ffae02bd980b9e242e5671bf104c) Python: 3.8.10 OS: Ubuntu 20.04.5

Reproduction script

from ray.rllib.algorithms.mbmpo import MBMPOConfig
from ray.rllib.examples.env.mbmpo_env import CartPoleWrapper

config = (
    MBMPOConfig()
    .environment(env=CartPoleWrapper)
    .framework("torch")
)

algo = config.build()
algo.train()

Issue Severity

High: It blocks me from completing my task.

Hello, I wanted to use MBMPO and I couldn't get the example running. It would be great if you can help!

Logs: mbmpo_log.txt

Versions / Dependencies Ray: 2.1.0 Python: 3.9.13 OS: Ubuntu 22.04.2

Reproduction script-

from ray.tune import register_env from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.algorithms.sac import SACConfig from ray.rllib.algorithms.mbmpo import MBMPOConfig from ray.rllib.examples.env.mbmpo_env import CartPoleWrapper from ray.rllib.examples.env.mbmpo_env import PendulumWrapper from ray.rllib.evaluation.rollout_worker import RolloutWorker import os import time import wandb import numpy as np import tracemalloc import tensorboard if name == "main":

agent = (
    MBMPOConfig()
    .environment(env=PendulumWrapper,disable_env_checking=True)
    .rollouts(num_rollout_workers=10,num_envs_per_worker=20)
    .training(inner_adaptation_steps=1,maml_optimizer_steps=8,gamma=0.99,lambda_=1,lr=0.001,vf_clip_param=0.5,kl_target=0.003,kl_coeff=0.0000000001,inner_lr=0.001,num_maml_steps=15,model={'fcnet_hiddens': [32, 32],'free_log_std': True})
    .framework("torch")
    .build()
)

result = agent.train()
print(result)

ray-project / ray