Open Bam4d opened 2 years ago
Is the main purpose of it to run more random seeds? This should no longer be an issue with the new slurm integration in the benchmark utility https://docs.cleanrl.dev/get-started/benchmark-utility/#slurm-integration. It basically increments the seed per run :)
env_ids={{env_ids}}
seeds={{seeds}}
env_id=${env_ids[$SLURM_ARRAY_TASK_ID / {{len_seeds}}]}
seed=${seeds[$SLURM_ARRAY_TASK_ID % {{len_seeds}}]}
echo "Running task $SLURM_ARRAY_TASK_ID with env_id: $env_id and seed: $seed"
srun {{command}} --env-id $env_id --seed $seed #
Problem Description
In many CleanRL scripts, a timestamp is used as a differentiator in the naming of the jobs:
https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py#L134
In some very rare cases (when running on a slurm cluster or Sun Grid Engine for example) two scripts may be executed within a second of each other and end up with the same timestamp.
If a shared drive is used to store results (very common for these cluster setups) then the jobs can actually overwrite each other's data. You will end up with a bunch of very wierdly similar looking runs in wandb,
Checklist
poetry install
(see CleanRL's installation guideline.Current Behavior
Data overwritten due to shared drives and problematic naming convention which causes collisions.
wandb/tensorboard might throw an error but I've only ever seen this once.
Expected Behavior
Data should not be overwritten and runs should always have unique names.
Possible Solution
I actually have replaced
time.time()
withuuid.uuid4()
which is extemely unlikely to cause collisions.Steps to Reproduce
steps to reproduce are a bit pointless unless you have access to a fairly empty cluster, however I believe the bug is trivial enough to require no repro steps to understand.