[RLlib + Tune] PermissionError: [WinError 5] Access is denied: '../.tmp_generator' -> '..basic-variant-state-..' while training with ``Tuner``

davidhozic commented 8 months ago

What happened + What you expected to happen

While training multiple agents on an environment, the tuner crashes after certain time due to inability to replace the basic-variant-state file under ray_results. I'm using PPO on a custom environment and training with tune.Tuner.

I tried running this multiple times and it always seems to crash, sometimes quickly, sometimes after a few hours. It's quite annoying as I cannot leave my computer for this complex environment to train on its own.

Traceback:

2024-03-05 12:38:11,750 WARNING tune_controller.py:743 -- Trial controller checkpointing failed: [WinError 5] Access is denied: 'C:/Users/David/ray_results/PPO_2024-03-05_11-37-47\\.tmp_generator' -> 'C:/Users/David/ray_results/PPO_2024-03-05_11-37-47\\basic-variant-state-2024-03-05_11-37-47.json'
(RolloutWorker pid=31372) Stopping simulation...
(RolloutWorker pid=31372) ...

Traceback (most recent call last):
  File "C:\development\GIT\FuzbAI\main_rlib.py", line 111, in <module>
    results = tuner.fit()
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\tuner.py", line 381, in fit
    return self._local_tuner.fit()
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\impl\tuner_internal.py", line 509, in fit
    analysis = self._fit_internal(trainable, param_space)
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\impl\tuner_internal.py", line 628, in _fit_internal
    analysis = run(
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\tune.py", line 1002, in run
    runner.step()
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\execution\tune_controller.py", line 744, in step
    raise e
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\execution\tune_controller.py", line 741, in step
    self.checkpoint()
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\execution\tune_controller.py", line 478, in checkpoint
    self._checkpoint_manager.checkpoint(
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\execution\experiment_state.py", line 224, in checkpoint
    save_fn()
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\execution\tune_controller.py", line 377, in save_to_dir
    self._search_alg.save_to_dir(experiment_dir, session_str=self._session_str)
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\search\basic_variant.py", line 404, in save_to_dir
    _atomic_save(
  File "C:\development\GIT\FuzbAI\venv\lib\site-packages\ray\tune\utils\util.py", line 416, in _atomic_save
    os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))
PermissionError: [WinError 5] Access is denied: 'C:/Users/David/ray_results/PPO_2024-03-05_11-37-47\\.tmp_generator' -> 'C:/Users/David/ray_results/PPO_2024-03-05_11-37-47\\basic-variant-state-2024-03-05_11-37-47.json'

Versions / Dependencies

Python version: Python 3.10.11 (virtual environment) OS: Windows 11 x64

Dependencies:

name	version
aiosignal	1.3.1
annotated-types	0.6.0
anyio	4.2.0
attrs	23.2.0
certifi	2024.2.2
charset-normalizer	3.3.2
click	8.1.7
cloudpickle	3.0.0
colorama	0.4.6
dm-tree	0.1.8
exceptiongroup	1.2.0
Farama-Notifications	0.0.4
fastapi	0.109.2
filelock	3.13.1
frozenlist	1.4.1
fsspec	2024.2.0
gymnasium	0.28.1
h11	0.14.0
idna	3.6
imageio	2.34.0
jax-jumpy	1.0.0
Jinja2	3.1.2
jsonschema	4.21.1
jsonschema-specifications	2023.12.1
lazy_loader	0.3
lz4	4.3.3
markdown-it-py	3.0.0
MarkupSafe	2.1.3
mdurl	0.1.2
mpmath	1.3.0
msgpack	1.0.8
networkx	3.2.1
numpy	1.26.4
packaging	23.2
pandas	2.2.1
pillow	10.2.0
pip	23.0.1
protobuf	4.25.3
pyarrow	6.0.1
pybullet	3.2.6
pydantic	2.6.1
pydantic_core	2.16.2
Pygments	2.17.2
python-dateutil	2.9.0.post0
pytz	2024.1
PyYAML	6.0.1
ray	2.9.3
referencing	0.33.0
requests	2.31.0
rich	13.7.1
rpds-py	0.18.0
scikit-image	0.22.0
scipy	1.12.0
setuptools	65.5.0
six	1.16.0
sniffio	1.3.0
starlette	0.36.3
sympy	1.12
tensorboardX	2.6.2.2
tifffile	2024.2.12
torch	2.2.1+cu118
typer	0.9.0
typing_extensions	4.9.0
tzdata	2024.1
urllib3	2.2.0
uvicorn	0.27.0.post1

Reproduction script

from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig, PPO, PPOTorchPolicy
from ray.rllib.policy.policy import PolicySpec

from ray.tune.logger import pretty_print
from ray.tune import Tuner
from ray.train import RunConfig, CheckpointConfig

from gymnasium.spaces import Box

from FuzbAISim import FuzbAISim
from selfplaycallback import SelfPlayCallBack

import numpy as np

# Register environment to be seen by Ray clusters
register_env("fuzbai", lambda env_config: FuzbAISim())

policies = {}
for agent in ("goal", "defense", "offense", "attack"):
    observation_space = Box(
        np.array([0.0,   0.0,  -np.inf,  -np.inf,   0.0,  -32.0,  0.0]),
        np.array([1210,  700,   np.inf,   np.inf,   1.0,   32.0,  1.0]),
        dtype=np.float32
    )

    # Own side
    # Position, Rotation, Pos speed, Rot speed
    action_space = Box(
        np.array([-1.0, -1.0]),
        np.array([ 1.0,  1.0]),
        dtype=np.float32
    )

    policies[agent] = PolicySpec(
        observation_space=observation_space,
        action_space=action_space,
    )

config = (
    PPOConfig()
    .resources(num_gpus=1.0)
    .environment("fuzbai")
    .rollouts(observation_filter="MeanStdFilter", num_rollout_workers=8, num_envs_per_worker=8)
    .callbacks(SelfPlayCallBack)
    .multi_agent(policies=policies, policy_mapping_fn=lambda agent_id, *args, **kwargs: agent_id)
    .training(
        lr=1e-4,
        train_batch_size=512,
    )
)

tuner = Tuner(
    "PPO",
    param_space=config,
    run_config=RunConfig(
        stop={
            "episode_reward_mean": 50,
        },
        checkpoint_config=CheckpointConfig(5, "episode_reward_mean", checkpoint_at_end=True, checkpoint_frequency=50),
    ),
)

results = tuner.fit()
best_result = results.get_best_result("episode_reward_mean", "max")
print(pretty_print(best_result.metrics))
print(f"Checkpoint: {best_result.checkpoint.path}")

Issue Severity

High: It blocks me from completing my task.

Jeffjewett27 commented 8 months ago

I also face this issue. I am just doing single agent PPO. High severity: this blocks me.

config = (
      'PPO',
      .get_default_config()
      .environment(
          cli_args.env,
          env_config=get_env_config_from_cli(cli_args)
      )
      .training(
          grad_clip=1
      )
      .framework('torch')
      .rollouts(
          num_rollout_workers=cli_args.workers,
      )
)

stop = {
      "training_iteration": cli_args.stop_iters,
      "timesteps_total": cli_args.stop_timesteps,
      "episode_reward_mean": cli_args.stop_reward,
}

tuner = tune.Tuner(
        'PPO',
        param_space=config.to_dict(),
        run_config=air.RunConfig(
            stop=stop,
            checkpoint_config=train.CheckpointConfig(checkpoint_frequency=4, num_to_keep=2),
            callbacks=[
                WandbLoggerCallback(
                    project="project"
                )
            ]
        ),
    )
    results = tuner.fit()

justinvyu commented 4 months ago

This seems to be caused by os.replace playing badly with windows + temporary files. The fix seems to be using os.rename instead.

@Jeffjewett27 @davidhozic Would you be interested in opening a PR to fix this?

davidhozic commented 4 months ago

This seems to be caused by os.replace playing badly with windows + temporary files. The fix seems to be using os.rename instead.

@Jeffjewett27 @davidhozic Would you be interested in opening a PR to fix this?

Sure. I'll do it tomorrow with addition of running an extended-length test.

justinvyu commented 4 months ago

Thanks @davidhozic! FYI we don't have great windows test coverage for Train/Tune, but you can add a small test for this method here: https://github.com/ray-project/ray/blob/master/python/ray/train/tests/test_windows.py#L29

Is this issue flaky or consistently reproducible?

davidhozic commented 4 months ago

Thanks @davidhozic! FYI we don't have great windows test coverage for Train/Tune, but you can add a small test for this method here: https://github.com/ray-project/ray/blob/master/python/ray/train/tests/test_windows.py#L29

Is this issue flaky or consistently reproducible?

Well it doesn't always seem to happen, but it has certainly happened a few times. Recently I've just been using WSL as a workaround.

davidhozic commented 4 months ago

Quick update. I think it may be Windows Defender that's causing the issue. Not sure why it never fails with the with open(...) part, but fails with os.replace. I'll do some more tests to be sure.

davidhozic commented 4 months ago

@justinvyu Yeah it's definitely the anti-virus. Disabling it doesn't seem to give any more issues. I don't really see any fixes for this either, except a try-except while loop which keeps trying until it's successful... not exactly the best solution, but we can't exactly influence windows defender. It also seems to only happen for me if my screen is locked. Disabling Windows Defender or leaving the screen unlocked seems to work fine.

Is the try-catch retry while loop an acceptable solution?

anyscalesam commented 1 month ago

cc @justinvyu

ray-project / ray