ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

[RLlib] GCS connection failure with MAML on Torch framework #34624

Open simonsays1980 opened 1 year ago

simonsays1980 commented 1 year ago

What happened + What you expected to happen

What happened

Running the below example with framework Torch resulted after a long time into a GCS connection error:

/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xce4daa) [0x7f29956e4daa] ray::operator<<()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xce6892) [0x7f29956e6892] ray::SpdLogMessage::Flush()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x7f29956e6ba7] ray::RayLog::~RayLog()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0x73b6fd) [0x7f299513b6fd] ray::rpc::GcsRpcClient::CheckChannelStatus()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(_ZN5boost4asio6detail12wait_handlerIZN3ray3rpc12GcsRpcClient15SetupCheckTimerEvEUlNS_6system10error_codeEE_NS0_9execution12any_executorIJNS9_12context_as_tIRNS0_17execution_contextEEENS9_6detail8blocking7never_tILi0EEENS9_11prefer_onlyINSG_10possibly_tILi0EEEEENSJ_INSF_16outstanding_work9tracked_tILi0EEEEENSJ_INSN_11untracked_tILi0EEEEENSJ_INSF_12relationship6fork_tILi0EEEEENSJ_INSU_14continuation_tILi0EEEEEEEEE11do_completeEPvPNS1_19scheduler_operationERKS7_m+0x303) [0x7f299513bba3] boost::asio::detail::wait_handler<>::do_complete()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xcf5a3b) [0x7f29956f5a3b] boost::asio::detail::scheduler::do_run_one()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xcf6c71) [0x7f29956f6c71] boost::asio::detail::scheduler::run()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xcf6ee0) [0x7f29956f6ee0] boost::asio::io_context::run()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0x55f8c8) [0x7f2994f5f8c8] std::thread::_State_impl<>::_M_run()
/home/simon/git-projects/ray-experiments/experiments/ppo_normalized_advantages/.venv/lib/python3.9/site-packages/ray/_raylet.so(+0xe2ac90) [0x7f299582ac90] execute_native_thread_routine
/lib64/libc.so.6(+0x8b12d) [0x7f29a37fe12d] start_thread
/lib64/libc.so.6(+0x10cbc0) [0x7f29a387fbc0] __GI___clone3

Also, over the whole run taking almost half an hour no training iteration appeared to happen and the tfevents fil remained empty.

What you expected to happen

That MAML runs on the test example and trains.

Versions / Dependencies

Python 3.9.12 Ray 2.3.1 Fedora Linux 37

Reproduction script


from ray.rllib.examples.env.pendulum_mass import PendulumMassEnv
from ray.rllib.algorithms.maml.maml import MAMLConfig

from ray.tune import register_env
from ray import air, tune

register_env("pendulum_mass", lambda config: PendulumMassEnv())

config = (
    MAMLConfig()
    .environment(
        env="pendulum_mass",
        clip_actions=False,
    )
    .rollouts(
        rollout_fragment_length=200,
        num_rollout_workers=2,
        num_envs_per_worker=10,                    
    )
    .framework(
        framework="torch",
        eager_tracing=False,
    )
    .training(
        inner_adaptation_steps=1,
        maml_optimizer_steps=5,
        gamma=0.99,
        lambda_=1.0,
        lr=0.001,
        vf_loss_coeff=0.5,
        clip_param=0.3,
        kl_target=0.1,
        kl_coeff=0.001,
        inner_lr=0.03,
        model={
            "fcnet_hiddens": [64, 64],
            "free_log_std": True,
        }        
    )
    .exploration(
        explore=True,
    )
    .debugging(
        log_level="DEBUG",
    )
)

tuner = tune.Tuner(
    "MAML",
    param_space=config.to_dict(),
    run_config=air.RunConfig(
        stop={"training_iteration": 10}
    )
)

tuner.fit()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

jjyao commented 8 months ago

@simonsays1980 are you able to reproduce it with latest Ray version?

simonsays1980 commented 8 months ago

@simonsays1980 are you able to reproduce it with latest Ray version?

Hi @jjyao, thanks for coming back to this. I have to retry this - its a year ago. I ping you as soon as I have results.