What happened

Running the below example with framework Torch resulted after a long time into a GCS connection error:

/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xce4daa) [0x7f29956e4daa] ray::operator<<()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xce6892) [0x7f29956e6892] ray::SpdLogMessage::Flush()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x7f29956e6ba7] ray::RayLog::~RayLog()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0x73b6fd) [0x7f299513b6fd] ray::rpc::GcsRpcClient::CheckChannelStatus()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(_ZN5boost4asio6detail12wait_handlerIZN3ray3rpc12GcsRpcClient15SetupCheckTimerEvEUlNS_6system10error_codeEE_NS0_9execution12any_executorIJNS9_12context_as_tIRNS0_17execution_contextEEENS9_6detail8blocking7never_tILi0EEENS9_11prefer_onlyINSG_10possibly_tILi0EEEEENSJ_INSF_16outstanding_work9tracked_tILi0EEEEENSJ_INSN_11untracked_tILi0EEEEENSJ_INSF_12relationship6fork_tILi0EEEEENSJ_INSU_14continuation_tILi0EEEEEEEEE11do_completeEPvPNS1_19scheduler_operationERKS7_m+0x303) [0x7f299513bba3] boost::asio::detail::wait_handler<>::do_complete()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xcf5a3b) [0x7f29956f5a3b] boost::asio::detail::scheduler::do_run_one()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xcf6c71) [0x7f29956f6c71] boost::asio::detail::scheduler::run()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xcf6ee0) [0x7f29956f6ee0] boost::asio::io_context::run()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0x55f8c8) [0x7f2994f5f8c8] std::thread::_State_impl<>::_M_run()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xe2ac90) [0x7f299582ac90] execute_native_thread_routine
/lib64/libc.so.6(+0x8b12d) [0x7f29a37fe12d] start_thread
/lib64/libc.so.6(+0x10cbc0) [0x7f29a387fbc0] __GI___clone3

Also, over the whole run taking almost half an hour no training iteration appeared to happen and the tfevents fil remained empty.

What you expected to happen

That MAML runs on the test example and trains.

Versions / Dependencies

Python 3.9.12 Ray 2.3.1 Fedora Linux 37

Reproduction script


from ray.rllib.examples.env.pendulum_mass import PendulumMassEnv
from ray.rllib.algorithms.maml.maml import MAMLConfig

from ray.tune import register_env
from ray import air, tune

register_env("pendulum_mass", lambda config: PendulumMassEnv())

config = (
    MAMLConfig()
    .environment(
        env="pendulum_mass",
        clip_actions=False,
    )
    .rollouts(
        rollout_fragment_length=200,
        num_rollout_workers=2,
        num_envs_per_worker=10,                    
    )
    .framework(
        framework="torch",
        eager_tracing=False,
    )
    .training(
        inner_adaptation_steps=1,
        maml_optimizer_steps=5,
        gamma=0.99,
        lambda_=1.0,
        lr=0.001,
        vf_loss_coeff=0.5,
        clip_param=0.3,
        kl_target=0.1,
        kl_coeff=0.001,
        inner_lr=0.03,
        model={
            "fcnet_hiddens": [64, 64],
            "free_log_std": True,
        }        
    )
    .exploration(
        explore=True,
    )
    .debugging(
        log_level="DEBUG",
    )
)

tuner = tune.Tuner(
    "MAML",
    param_space=config.to_dict(),
    run_config=air.RunConfig(
        stop={"training_iteration": 10}
    )
)

tuner.fit()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

jjyao commented 8 months ago

@simonsays1980 are you able to reproduce it with latest Ray version?

simonsays1980 commented 8 months ago

@simonsays1980 are you able to reproduce it with latest Ray version?

Hi @jjyao, thanks for coming back to this. I have to retry this - its a year ago. I ping you as soon as I have results.

ray-project / ray

[RLlib] GCS connection failure with MAML on Torch framework #34624

What happened + What you expected to happen

What happened

What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity