Regarding resuming from checkpoint

jseplae commented 5 years ago

I've resorted to posting a question here (for advice, I guess) as I'm noticing a considerable increase in train loss variance sometimes after resuming from a checkpoint:

resume-1

This is a mixture/raw-input -model using ema decay, and no lr decay, with a large dataset consisting of small audio chunks (standard PyTorch batch-shuffling). Note that the first blue-red switch is the resume-point at 100k steps (no problems there). Should I expect to having to restart multiple times to get a good restart, i.e., is this behavior expected? If anyone has experience with similar issues, I'd love to hear from you.

geneing commented 5 years ago

@jseplae Are you saving and restoring optimizer state? My guess from your description is that the optimizer switched to smaller step size before you saved the state. When you restarted, the optimizer restarted with the original (large?) step size. The solution is to save optimizer state too.

jseplae commented 5 years ago

@jseplae Are you saving and restoring optimizer state? My guess from your description is that the optimizer switched to smaller step size before you saved the state. When you restarted, the optimizer restarted with the original (large?) step size. The solution is to save optimizer state too.

Hi, and thanks for the comment. The default is to save the optimizer state, which was done here. But something is obviously different with the second resumption (light blue). I have since went back to normal non-chunked files, and noticed that for my data, an initial lr of 10^-4 looks better without schedule. The reason I'm not using the original noam decay is because I have in the order of 10 times as much data than what's in cmu, for example. From initial tests, it seemed that the 4k warmup-time was too aggressive, i.e., decaying to too low values too early.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

r9y9 / wavenet_vocoder

Regarding resuming from checkpoint #132