pints-team / pints

Probabilistic Inference on Noisy Time Series
http://pints.readthedocs.io
Other
220 stars 32 forks source link

Add gradient-free HMC with efficient kernel exponential families #101

Closed ben18785 closed 3 weeks ago

ben18785 commented 6 years ago

This recent paper has attracted quite a lot of attention and implements a non-approximate form of HMC, that does not require sensitivities.

On a brief look through the paper, it seems that it is going to be a fairly complex process to add this but, I think, worthwhile. How would people feel about me reaching out to the paper's authors from Pints? Particularly, the guy in Oxford stats...

MichaelClerx commented 6 years ago

Sounds great to me! @mirams ? @sanmitraghosh ?

sanmitraghosh commented 6 years ago

I have read this paper. It basically builds an emulator to the potential using kernel trick. This is actually been proposed way back by Rasmussen in this paper http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf2080.pdf.

I think this sort of method is useful when we can get the curvature of the likelihood as in Riemanian HMC. But using only gradient methods as in HMC (using the real sensitivity or surrogate) I am not sure will work. There is a lot of literature which points to the fact that for ODE model manifold a gradient step is not useful. Thus I think if we can say run some of the interesting ODEs using HMC with real sensitivity and find that useful, then only we have a point in diving into any gradient based sampler. But on the other hand if we can easily extend this type of method to calculate the Hessian then its worth doing.

ben18785 commented 6 years ago

I'm not convinced that differential equation models are that unique. There's probably more opportunity for a multi-modal posterior, but otherwise, I don't see the 'statistics' part of the problem (as opposed to the mathematical part where the ODE or PDE is solved) as being that hard, necessarily. And, I think that surely knowledge of the gradient would be useful...it's proved to be the case with pretty much any other statistical problem.

On Thu, Nov 30, 2017 at 5:09 PM, Sanmitra Ghosh notifications@github.com wrote:

I have read this paper. It basically builds an emulator to the potential using kernel trick. This is actually been proposed way back by Rasmussen in this paper http://www.kyb.mpg.de/fileadmin/user_upload/files/ publications/pdfs/pdf2080.pdf.

I think this sort of method is useful when we can get the curvature of the likelihood as in Riemanian HMC. But using only gradient methods as in HMC (using the real sensitivity or surrogate) I am not sure will work. There is a lot of literature which points to the fact that for ODE model manifold a gradient step is not useful. Thus I think if we can say run some of the interesting ODEs using HMC with real sensitivity and find that useful, then only we have a point in diving into any gradient based sampler. But on the other hand if we can easily extend this type of method to calculate the Hessian then its worth doing.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/pints-team/pints/issues/101#issuecomment-348255122, or mute the thread https://github.com/notifications/unsubscribe-auth/AESFqIZdIaRWOhN-MCJKIpM83nQrSl3Gks5s7uE7gaJpZM4QwvNJ .

sanmitraghosh commented 6 years ago

Well, rather than me telling you why the ODE problem is unique (I can show you that when we meet next) in a purely statistical sense, its better you read this https://arxiv.org/pdf/1501.07668.pdf.

By the way I have played with all the "any other statistical models" and have found this (ODE) problem to be unique.

People back in the 60's realized that gradient based (steepest descent) is inferior to Newton (Hessian based descent) algorithm, because of long valleys. That's why Levenberg algorithm is so popular for non-linear least squares.

However, I absolutely agree that HMC is better than nothing (by nothing I mean a vanilla Metropolis, MALA, and the Haario extensions).

ben18785 commented 6 years ago

Hi Sanmitra, I need to have a read then. If ODE models are more complex, my thoughts are that it's due to the inherent identifiability issues with them.

Still think HMC is worth pursuing -- think we're agreed about that! I am enjoying this debate! Let's have it in person on Tuesday.

Best,

Ben

On 30 Nov 2017, at 19:04, Sanmitra Ghosh notifications@github.com wrote:

Well, rather than me telling you why the ODE problem is unique (I can show you that when we meet next) in a purely statistical sense, its better you read this https://arxiv.org/pdf/1501.07668.pdf.

By the way I have played with all the "any other statistical models" and have found this (ODE) problem to be unique.

People back in the 60's realized that gradient based (steepest descent) is inferior to Newton (Hessian based descent) algorithm, because of long valleys. That's why Levenberg algorithm is so popular for non-linear least squares.

However, I absolutely agree that HMC is better than nothing (by nothing I mean a vanilla Metropolis, MALA, and the Haario extensions).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

sanmitraghosh commented 6 years ago

Yes give HMC a try and see. The real problem is the non-linear relation between the parameters and the state space. To be honest this is more a geometric problem than a statistical one. This is why I have found common statistical intuitions of little use in these problems. I think there is a lot of research material in dynamical systems, control theory, non-linear non-Gaussian filtering, non-linear optimisation, information geometry, etc. and basing our intuitions on these is more useful than stuff from Gelman & co. I think they have designed brilliant algorithms but to solve a completely different set of problems. Mostly exponential conjugate ones :)

sanmitraghosh commented 6 years ago

I forgot to mention this. I have actually used HMC with sundials ode suite (with sensitivity) during my thesis. I have the code written in MATLAB. Its somewhere in my home computer. Let me know if that's helpful, I'll then try to find it and share it with you. You then just need to translate accordingly.

ben18785 commented 6 years ago

The paper is interesting, although I do think it's a rather grandiose way of essentially saying 'models with many parameters are often under-identified'. Haha. I do think that the propensity for under-identification with ODE models is probably more extensive than I have anticipated. As you said, this is likely manifested by long valleys between peaks (so we have multi-modality).

Re: your code -- that might be very useful mate, thanks for the offer! Let's discuss it next week.

What's your feeling on methods that will work best on ODE problems?

On Fri, Dec 1, 2017 at 10:55 AM, Sanmitra Ghosh notifications@github.com wrote:

I forgot to mention this. I have actually used HMC with sundials ode suite (with sensitivity) during my thesis. I have the code written in MATLAB. Its somewhere in my home computer. Let me know if that's helpful, I'll then try to find it and share it with you. You then just need to translate accordingly.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/pints-team/pints/issues/101#issuecomment-348465415, or mute the thread https://github.com/notifications/unsubscribe-auth/AESFqOU08TgkiOZNVuGYVMMhLhQAGbauks5s79s3gaJpZM4QwvNJ .

ben18785 commented 6 years ago

I like the look of this paper (highly cited in physics). It's a type of nested sampling. @sanmitraghosh have a look at the likelihood they show on page 5. Looks pretty nasty! I might try and get this up and running along with simpler nested sampling...

sanmitraghosh commented 6 years ago

See the only two problems where the first moment of the generative distribution (likelihood) is a complex non-linear transform of the parameters which we want to infer are Neural Networks, ODEs. So, we have :

the likelihood as N(mu,sigma), where mu=f(f(f(f(x)))). Mostly a very complex non-linearity that we can't even describe analytically. For NNworks we have thus seen all the best of statistical intuition completely falling apart, and I guess the same is true if and when we introduce complex ODEs such as the ones from gene circuits and what not.

Thus I believe any standard mcmc technique (augmented) with cmaes will give us a working answer. BUT that is akin to interpreting the entire USA culture by walking around only the Bay area.

sanmitraghosh commented 6 years ago

Well my dearest friend had done two years of post-doc under Prof. Hobson supervision working on exactly the nested sampling methods, for geospatial (PDE inverse) problems. I know it from the horse's mouth that they do have the same problems as we see in ODEs. In fact the guy working on this method in that lab used to ring me up and have long discussions about how we get around these issues. To be honest I don't know and I am pretty much sure nobody knows. But if you want to use Nested sampling then you can use this package :https://www.ncbi.nlm.nih.gov/pubmed/25399028 which is more appropriate for our problems.

sanmitraghosh commented 6 years ago

In an ideal world I would have used a SMC/Particle Filter with lots of particles (10K) and a slow adaptive annealing scheme. And I would have moved each particle using a HMC sampler. However, there is no algorithm that does these steps yet. So we have to build it. I would believe (almost) in the inference results if we can carry out such a procedure. Otherwise, I believe its better to use a quick and dirty variational inference for these sort of problem. This won't give us the correct answer but a usable one would go a long way for me.

rccreswell commented 4 years ago

While working on inference for noise process parameters, I happened to come across the same paper @ben18785 mentioned when this issue was opened. Has anyone had any further thoughts on whether this method makes sense for Pints?

MichaelClerx commented 4 years ago

Not sure! If it really involves creating an emulator, then I'd want to know how generalisable this method is. We've had several people do emulators for specific problems and spend years on them, so wondering why this particular one would be easy and generalisable?