pfnet / pfrl

PFRL: a PyTorch-based deep reinforcement learning library
MIT License
1.2k stars 157 forks source link

Snapshot for preemption #155

Open knshnb opened 3 years ago

knshnb commented 3 years ago

Current pfrl does not support snapshot of training, which is important in many job systems such as Kubernetes. This PR support saving and loading snapshot including replay buffer.

Done

Not Done

Could you check the current implementation strategy and give some ideas on how to implement the above points?

muupan commented 3 years ago

Thanks for your PR! It is really good to have better resumability.

General comments on resumability

First, let me summarize what I think need to be done to achieve resumability. Please comment if I miss something. I checked the points supported by this PR.

Things that need to be snapshotted for resumability except randomness:

RNG-related things that need to be snapshotted for complete resumability:

This is a large list, and it would be a tough task to support all of it. I think it is ok to start supporting only part of it if

Specific comments on this PR

knshnb commented 3 years ago

Thank you for the detailed comments!! Below is a memo of discussion with @muupan san

What I skip in this PR

What I implement

knshnb commented 3 years ago

I conducted the experiment that you suggested with the following command. python examples/atari/reproduction/dqn/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 --checkpoint-freq 2000000 --save-snapshot --load-snapshot --seed ${SEED} --exp-id ${SEED}

For each seed, I ran another training resuming from the snapshot of 6000000-step. As shown in the graph below, the score transitions after resuming from the snapshots were roughly the same as the ones without resumption. image

In this experiment, each snapshot was about 6.8GB and took around 60-100 (s) to save in an NFS server in my environment. You can check how many seconds it took to save each snapshot in snapshot_history.txt.

muupan commented 3 years ago

/test

pfn-ci-bot commented 3 years ago

Successfully created a job for commit dde7ebf:

knshnb commented 3 years ago

Sorry, I fixed the linter problem

knshnb commented 3 years ago

(I forgot to write this) Memo: It requires about twice more CPU memory if you save snapshots (~30GB in the above experiment).

knshnb commented 2 years ago

Hi! Is there any action required for this PR to be merged?