Open knshnb opened 3 years ago
Thanks for your PR! It is really good to have better resumability.
First, let me summarize what I think need to be done to achieve resumability. Please comment if I miss something. I checked the points supported by this PR.
Agent.save
)Agent.save
)[ ] Experiment record
scores.txt
)*
indicates things needed only when you resume a half-way training episode.
torch
, random
, and numpy
This is a large list, and it would be a tough task to support all of it. I think it is ok to start supporting only part of it if
max_score
is saved separately, but I think it is better to load scores.txt
to restore all the evaluation information.checkpoint_freq
steps is not always desirable since it would consume time and storage mostly due to replay buffer. It should be optional.python examples/atari/reproduction/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000
(which takes <1day with a single GPU, a single CPU, and 14GB CPU RAM) with snapshots saved. Run with five random seeds: --seed 0/1/2/3/4
since variance among runs is high.Thank you for the detailed comments!! Below is a memo of discussion with @muupan san
save_agent
only when take_resumable_snapshot
is Truesteps
and episodes
in a file (such as checkpoint.txt)
max_score
from scores.txt
scores.txt
in snapshot for the case eval_interval != checkpoint_freq
pfrl/examples_tests/atari/reproduction/test_dqn.sh
I conducted the experiment that you suggested with the following command.
python examples/atari/reproduction/dqn/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 --checkpoint-freq 2000000 --save-snapshot --load-snapshot --seed ${SEED} --exp-id ${SEED}
For each seed, I ran another training resuming from the snapshot of 6000000-step. As shown in the graph below, the score transitions after resuming from the snapshots were roughly the same as the ones without resumption.
In this experiment, each snapshot was about 6.8GB and took around 60-100 (s) to save in an NFS server in my environment. You can check how many seconds it took to save each snapshot in snapshot_history.txt
.
/test
Successfully created a job for commit dde7ebf:
Sorry, I fixed the linter problem
(I forgot to write this) Memo: It requires about twice more CPU memory if you save snapshots (~30GB in the above experiment).
Hi! Is there any action required for this PR to be merged?
Current pfrl does not support snapshot of training, which is important in many job systems such as Kubernetes. This PR support saving and loading snapshot including replay buffer.
Done
python examples/gym/train_dqn_gym.py --env CartPole-v0 --steps=5000 --eval-n-runs=10 --eval-interval=1000 --load_snapshot --checkpoint-freq=1000
Not Done
Could you check the current implementation strategy and give some ideas on how to implement the above points?