Open davidhozic opened 8 months ago
I also face this issue. I am just doing single agent PPO. High severity: this blocks me.
config = (
'PPO',
.get_default_config()
.environment(
cli_args.env,
env_config=get_env_config_from_cli(cli_args)
)
.training(
grad_clip=1
)
.framework('torch')
.rollouts(
num_rollout_workers=cli_args.workers,
)
)
stop = {
"training_iteration": cli_args.stop_iters,
"timesteps_total": cli_args.stop_timesteps,
"episode_reward_mean": cli_args.stop_reward,
}
tuner = tune.Tuner(
'PPO',
param_space=config.to_dict(),
run_config=air.RunConfig(
stop=stop,
checkpoint_config=train.CheckpointConfig(checkpoint_frequency=4, num_to_keep=2),
callbacks=[
WandbLoggerCallback(
project="project"
)
]
),
)
results = tuner.fit()
This seems to be caused by os.replace
playing badly with windows + temporary files. The fix seems to be using os.rename
instead.
@Jeffjewett27 @davidhozic Would you be interested in opening a PR to fix this?
This seems to be caused by
os.replace
playing badly with windows + temporary files. The fix seems to be usingos.rename
instead.@Jeffjewett27 @davidhozic Would you be interested in opening a PR to fix this?
Sure. I'll do it tomorrow with addition of running an extended-length test.
Thanks @davidhozic! FYI we don't have great windows test coverage for Train/Tune, but you can add a small test for this method here: https://github.com/ray-project/ray/blob/master/python/ray/train/tests/test_windows.py#L29
Is this issue flaky or consistently reproducible?
Thanks @davidhozic! FYI we don't have great windows test coverage for Train/Tune, but you can add a small test for this method here: https://github.com/ray-project/ray/blob/master/python/ray/train/tests/test_windows.py#L29
Is this issue flaky or consistently reproducible?
Well it doesn't always seem to happen, but it has certainly happened a few times. Recently I've just been using WSL as a workaround.
Quick update.
I think it may be Windows Defender that's causing the issue. Not sure why it never fails with the with open(...)
part, but fails with os.replace
. I'll do some more tests to be sure.
@justinvyu Yeah it's definitely the anti-virus. Disabling it doesn't seem to give any more issues. I don't really see any fixes for this either, except a try-except while loop which keeps trying until it's successful... not exactly the best solution, but we can't exactly influence windows defender. It also seems to only happen for me if my screen is locked. Disabling Windows Defender or leaving the screen unlocked seems to work fine.
Is the try-catch retry while loop an acceptable solution?
cc @justinvyu
What happened + What you expected to happen
While training multiple agents on an environment, the tuner crashes after certain time due to inability to replace the basic-variant-state file under ray_results. I'm using PPO on a custom environment and training with
tune.Tuner
.I tried running this multiple times and it always seems to crash, sometimes quickly, sometimes after a few hours. It's quite annoying as I cannot leave my computer for this complex environment to train on its own.
Traceback:
Versions / Dependencies
Python version: Python 3.10.11 (virtual environment) OS: Windows 11 x64
Dependencies:
Reproduction script
Issue Severity
High: It blocks me from completing my task.