Open fritzo opened 4 years ago
I believe this is important for my experiments so I want to fix this soon.
@fehiepsi i think the easiest way to get something that works would be to add an option to ClippedAdam
like ignore_zero_gradient_stats
=True
/False
you'd then need to keep track of state['step']
on a per-coordinate basis (i.e. the optimizer would need more memory) and then you'd just need to change the updates to do vectorized/masked updates of the statistics
@martinjankowiak can you explain why it is necessary to keep track of coordinate-wise state['step']
? I would think that step
could be approximated as global, since the poisson approximation concentrates (in contrast to the other statistics).
well you could make that approximation. but i was assuming a pretty generic optimizer that makes few assumptions (just don't update when grad is zero)
I agree the coordinate-wise step
is parsimonious. Another parsimonious assumption could be that the gradient distribution is a zero-inflated Normal, so that with slight modifications, the usual Adam
statistics can learn that distribution's parameters. Both versions seem reasonable.
@martinjankowiak suggested that, now that #1796 makes
pyro.param
aware of subsampling, we could in principle make PyTorch optimizers whose gradient statistics are updated only for those elements that appear in a subsample. This would give lower-variance gradient estimates and would be cheaper.@martinjankowiak also points out that an alternative is to amortize the guide.