Closed Dualmirror closed 7 years ago
Sorry for the confusing. I'll update the appendix (refer to this update).
The batch size of 100 is what I trained due to gpu memory constraint. (the batch size of 200 was for evaluation, which is meaningless for training.)
The decay_factor=0.99997592083
is borrowed from this. The commented equation math.exp(math.log(0.1)/opt.learning_rate_decay_every/opt.iterPerEpoch)
is from here, but it does not help for our case. What I experience is that the optimization is highly tricky. Maybe 0.99997592083
is golden with RMSProp (or not). I leave this for further investigation. (I should discuss with @jiasenlu 🤔)
A training option kick_interval
is inspired by deepsense.io. For VQA, this option is borrowed from our previous work, after an empirical observation of minor improvement. I am not sure the adaptive change of kick_interval
is helpful or not.
Thank you for answering
@Dualmirror you're welcome!
decay_factor should be 0.99999040594147(not 0.99997592083) if opt.iterPerEpoch = 240000 / opt.batch_size ,and opt.batch_size = 100. in the paper, batch_size is 200 In fact, opt.iterPerEpoch should be 334554/ opt.batch_size ,so the kick_interval must be changed too