Snapshot for preemption

knshnb commented 3 years ago

Current pfrl does not support snapshot of training, which is important in many job systems such as Kubernetes. This PR support saving and loading snapshot including replay buffer.

Done

save & load snapshot
- agent, replay buffer, step_offset, max_score
test locally by python examples/gym/train_dqn_gym.py --env CartPole-v0 --steps=5000 --eval-n-runs=10 --eval-interval=1000 --load_snapshot --checkpoint-freq=1000

Not Done

reflect on examples/* (Do we just need to create one separate example for snapshot?)
log (scores.txt)
test script

Could you check the current implementation strategy and give some ideas on how to implement the above points?

muupan commented 3 years ago

Thanks for your PR! It is really good to have better resumability.

General comments on resumability

First, let me summarize what I think need to be done to achieve resumability. Please comment if I miss something. I checked the points supported by this PR.

Things that need to be snapshotted for resumability except randomness:

[ ] Agent
- [x] Model (Agent.save)
- [x] Optimizer (Agent.save)
- [x] Replay buffer
- [ ] recurrent states (only when recurrent models are used)*
- [ ] any other internal states of Agent (e.g. PPO holds its own dataset)
- [ ] statistics
[ ] Env
- [ ] Env's internal state*
[ ] Experiment record
- [x] max_score
- [x] steps
- [ ] episodes
- [ ] past evaluation results (to reproduce scores.txt)
* indicates things needed only when you resume a half-way training episode.

RNG-related things that need to be snapshotted for complete resumability:

[ ] Env's internal RNG states
[ ] Module-level RNG states of torch, random, and numpy
[ ] Any other random states used in code (gym.spaces has its own random state, for example)

This is a large list, and it would be a tough task to support all of it. I think it is ok to start supporting only part of it if

it is tested and confirmed that the snapshotting functionality works as users expect for some tasks, and
its limitations are clarified so that users can know when it won't work as expected.

Specific comments on this PR

max_score is saved separately, but I think it is better to load scores.txt to restore all the evaluation information.
Snapshotting after every checkpoint_freq steps is not always desirable since it would consume time and storage mostly due to replay buffer. It should be optional.
How fast is it to save and load a large replay buffer e.g. with 1 million transitions each of which is 10KB? (this is almost what you would expect when you run DQN for Atari, so by running the experiments suggested below you can see it too).
Can you run experiments to demonstrate this PR works in non-trivial settings? My suggestion is:
- Run python examples/atari/reproduction/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 (which takes <1day with a single GPU, a single CPU, and 14GB CPU RAM) with snapshots saved. Run with five random seeds: --seed 0/1/2/3/4 since variance among runs is high.
- Resume from 5000000 steps to 10000000 steps for each seed.
- Compare they roughly match.

knshnb commented 3 years ago

Thank you for the detailed comments!! Below is a memo of discussion with @muupan san

What I skip in this PR

rng-related things
reccurrent states of model, env’s internal state (needed only when you resume a half-way training episode.)
other internal states of Agent
Agent statistics

What I implement

save snapshot instead of save_agent only when take_resumable_snapshot is True
save steps and episodes in a file (such as checkpoint.txt)
- Agent statistics should be included here in the future
restore max_score from scores.txt
- include scores.txt in snapshot for the case eval_interval != checkpoint_freq
Add test in pfrl/examples_tests/atari/reproduction/test_dqn.sh

knshnb commented 3 years ago

I conducted the experiment that you suggested with the following command. python examples/atari/reproduction/dqn/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 --checkpoint-freq 2000000 --save-snapshot --load-snapshot --seed ${SEED} --exp-id ${SEED}

For each seed, I ran another training resuming from the snapshot of 6000000-step. As shown in the graph below, the score transitions after resuming from the snapshots were roughly the same as the ones without resumption.

In this experiment, each snapshot was about 6.8GB and took around 60-100 (s) to save in an NFS server in my environment. You can check how many seconds it took to save each snapshot in snapshot_history.txt.

muupan commented 3 years ago

/test

pfn-ci-bot commented 3 years ago

Successfully created a job for commit dde7ebf:

Dashboard for commit dde7ebf

knshnb commented 3 years ago

Sorry, I fixed the linter problem

knshnb commented 3 years ago

(I forgot to write this) Memo: It requires about twice more CPU memory if you save snapshots (~30GB in the above experiment).

knshnb commented 2 years ago

Hi! Is there any action required for this PR to be merged?

pfnet / pfrl