pints-team / pints

Probabilistic Inference on Noisy Time Series
http://pints.readthedocs.io
Other
220 stars 32 forks source link

Work out interface to get 1st order sensitivities into Pints #43

Closed MichaelClerx closed 6 years ago

MichaelClerx commented 6 years ago

Gary wrote: "Another 'whilst I remember' type thing! It would be good to get boring-old-Fisher Information / Hessian at the MLE and the covariance matrix that that implies, so we could compare max likelihood with Bayesian for some of these problems. Some of our peaks are so unimodal I suspect it may be an excellent approximation for a lot of our problems, and a zillion times faster."

mirams commented 6 years ago

Yeah, I suspect things like CVODES (note the S) can get this for you fairly simply...

MichaelClerx commented 6 years ago

Hmmm. I guess that would make it outside the scope of Pints though, apart from the bit where we provide an interface to get the information in! Sanmitra was talking about something similar though, so I'll have a look at doing this for cell models anyway...

sanmitraghosh commented 6 years ago

take a look at this solver package. they have got the sensitivity staff done. http://www.jmodelica.org/assimulo_home/_modules/assimulo/examples/cvode_with_initial_sensitivity.html.

MichaelClerx commented 6 years ago

I'll try to get this in Myokit, and in Jonathan's new solver back-end once I know how that works. But neither are Pints issues so closing this ticket!

MichaelClerx commented 6 years ago

By the way, if anyone needs it in a hurry, Myokit already has a (slow, not cvode-based) method to do this: http://docs.myokit.org/api_simulations/PSimulation.html

And chaste can do it via the automatic jacobian generator, right?

mirams commented 6 years ago

Chaste's Jacobian is for the RHS (dy/dt = f(t,y) = [f_1 f_2 f_3...]) to speed up ODE solving, so it looks like [df_1/dy_1..df_1/dy_2 etc.] not really anything to do with outputs (or 'auxiliary functions' like current) with respect to parameters p [dy_1/dp_1...dy_1/dp_2 etc.].

Depends on what pints is supposed to be doing for us? But I would have thought this was in scope if it is supposed to be evaluating likelihoods and doing things with them (Fisher should be second derivative of -log likelihood w.r.t. parameters).

MichaelClerx commented 6 years ago

Oh ok. Well if you don't mind forward euler you can get dy/dp or (dI/dp) in Myokit for the time being :-)

Pints is not an ODE solver (and I'm pretty sure it shouldn't be). It expects users to provide external models that have a method simulate(self, parameters, times) that return a series of y(t) evaluations. These are then passed to Pints using the following interface: (https://github.com/martinjrobins/pints/blob/master/pints/_core.py#L14-L50) that your forward model should be able to implement.

We don't want to write yet another ODE/PDE/Everything solver

So what we can do is add an interface for a model that can also provide dy/dp|t evaluations, but it shouldn't be Pints' responsibility to calculate them.

mirams commented 6 years ago

I think I want Pints to go from dy/dp hessian to likelihood and Fisher information, and to provide some kick to whatever solvers are underneath to say which parameters etc. we are interested in at the moment...

MichaelClerx commented 6 years ago

As long as we can come up with a sensible interface for it. At the moment, Pints has no notion of a solver and I think that's a good thing!

martinjrobins commented 6 years ago

At the moment we have a separate class for models with sensitivities.

MichaelClerx commented 6 years ago

See also #343 and #344

Alternative is to add a grad1() method and a grad2() method to LogPdf etc., and have it raise a NotImplementedError by default. The grad1() method could then return either dL/dp or (L, dL/dp).

Not sure which is best!

(One thing we'd lose is that you can use the LogLikelihood classes as functions)

MichaelClerx commented 6 years ago

Just chatting to @sanmitraghosh about how good 1st order methods will need to be if we want to see an overal speed up.

Using the IKr model as an example, we have ns=2 states and np=8 (ignoring g and E) parameters.

This means that adding sensitivities means

  1. The ODE turns from a 2 state one into a 2+2*8=18 state one (because you need to integrate all the sensitivities).
  2. The RHS evaluation goes from solving approx 2*x evaluations to approx 2*x+2*x*8 evaluations (For auto-diff this is exact, although the speeds vary, using analytical solutions you'll get some simplifications and some complications, so I'm assuming it balances out)

If the solver efficiency is determined by the number of states to integrate, this means both the solver code and the RHS evaluation code will slow down by a factor of approximately ns / (ns + ns*np) = 1 / (1 + np)

That means the method will have to be (1 + np) (=9 for IKr) times more efficient if we want to see any benefit.

(Note: We're really after d output / d params, not d state / d params, but I'm assuming the output depends on every state)

MichaelClerx commented 6 years ago

Any thoughts @ben18785 and @mirams ? Am I missing something?

mirams commented 6 years ago

A lot of methods will use finite differences (or similar) to do something like gradient descent if they don't have sensitivities, so those will typically need to run model at parameters, and then plus a bit and minus a bit on each one, so will take (1+2*np = 19) evaluations of the ordinary ODE system to get a basic approximation for the gradient. So if you can get the exact gradient in 9 times the effort it is probably a win!

MichaelClerx commented 6 years ago

That's a good point! Not sure how it works out for sampling (MCMC methods do 1 evaluation per iteration, but I suppose you'd have to look at the effective sample size?). For optimisation there's definitely methods that re-use information (e.g. Nelder-Mead and other simplex methods that move 1 vertice at a time). The particle ones currently use 4 + floor(3 * ln(np)) as a heuristic for the evaluations per iteration, which works out as 10 per iteration for IKr

mirams commented 6 years ago

(same kind of thing for CVODE having an Analytic Jacobian versus multiple calls to the RHS to approximate it)

MichaelClerx commented 6 years ago

Here are two proposals:

1. Separate classes

The methods work with a LogPdf, that they optimise or sample from:

LogPdf
    __call__(p) --> L(p)
LogPdfS1
    __call__(p) --> L(p), L'(p)
LogPdfS2
    __call__(p) --> L(p), L'(p), L''(p)

The main example is the:

KnownNoiseLogLikelihood
    __call__(p) --> L(p)
KnownNoiseLogLikelihoodS1
    __call__(p) --> L(p), L'(p)
KnownNoiseLogLikelihoodS2
    __call__(p) --> L(p), L'(p), L''(p)

Note at this point there's potential for a bit of code duplication, as the S1 version does everything the bare version does, and the S2 version does everything the S1 version does.

The situation is very similar for an error measure E(p).

All our noise-model based LogLikelihoods extend the ProblemLogLikelihood class (but user-defined ones and toy ones don't have to).

ProblemLogLikelihood
    __call__(p) --> L(p), calls Problem.__call__(p)
ProblemLogLikelihoodS1
    __call__(p) --> L(p), L'(p), calls ProblemS1.__call__(p)
ProblemLogLikelihoodS2
    __call__(p) --> L(p), L'(p), L''(p), calls ProblemS2.__call__(p)

Our problem classes would then be defined as:

(Single/MultiOutput)Problem
    evaluate(p) --> y(p), calls ForwardModel.simulate(p, times)
(Single/MultiOutput)ProblemS1
    evaluate(p) --> y(p), y'(p), calls ForwardModelS1.simulate(p, times)
(Single/MultiOutput)ProblemS2
    evaluate(p) --> y(p), y'(p), y''(p), calls ForwardModelS2.simulate(p, times)

Note that y(p) here is typically the solution to an ODE y(p) = integrate(f(p), y0).

With associated model classes:

ForwardModel
    simulate(p, times) --> y(p)
ForwardModelS1
    simulate(p, times) --> y(p), y'(p)
ForwardModelS2
    simulate(p, times) --> y(p), y'(p), y''(p)

For ODEs, this model formulation makes a lot of sense as y'(p) will often be calculated in tandem with y(p) (e.g. using CVODE to integrate f(p) and f'(p)). This is true irrespective of how f'(p) was obtained (autodiff or analytical).

In this scenario, we can check at construction time that everything is going to work, and users only need to implement a single simulate method. However, there is scope for some duplicate code.

Situation 1: Method wants jacobian, model doesn't provide it. A method that wants a jacobian can check type of LogPdf it was given (so incompatibile models are detected at construction time).

Situation 2: Model has f'(x) and/or f''(x), method doesn't need it. The model can't be used! The user will have to make a new, simpler wrapper. This isn't as bad as it sounds: you don't want this overhead anyway if it isn't needed. You can also see this as the method warning the user of unneeded overhead!

Least-effort?: The situation for methods without sensitivities doesn't change at all, which is good. The interface to users is simple and there is no hidden stuff (e.g. caching) making Pints harder to understand.

In a slight variation, we could do this:

ForwardModel
    simulate(p, times) --> y(p)
ForwardModelS1
    simulate_s1(p, times) --> y(p), y'(p)
ForwardModelS2
    simulate_s2(p, times) --> y(p), y'(p), y''(p)

This would allow the creation of model wrappers implementing both ForwardModel (which defines simulate) and ForwardModelS1 (which defines simulate_s1), and that use the appropriate simulation method when needed.

2. Classes with separate methods

The methods work with a LogPdf, that they optimise or sample from:

LogPdf
    __call__(p) --> L(p)
    jacobian(p) --> L'(p)
    hessian(p) --> L''(p)

The main example is the:

KnownNoiseLogLikelihood
    __call__(p) --> L(p)
    jacobian(p) --> L'(p)
    hessian(p) --> L''(p)

Note that the calculation of L(p), L'(p), and L''(p) are fairly independent, as long as the simulated results are stored. For the problem class, we get

ProblemLogLikelihood
    __call__(p) --> L(p), calls Problem.__call__(p) OR uses cached result
    jacobian(p) --> L'(p), calls Problem.jacobian(p) OR uses cached result
    hessian(p) --> L''(p), calls Problem.hessian(p) OR uses cached result

At this point, we could do some caching to ensure that no double simulations are run, although we could also leave that to the problem class:

(Single/MultiOutput)Problem
    evaluate(p) --> y(p), calls ForwardModel.values OR uses cached result
    jacobian(p) --> y'(p), calls ForwardModel.jacobian OR uses cached result
    hessian(p) --> y''(p), calls ForwardModel.hessian OR uses cached result

Again, we could leave the caching for a lower level:

ForwardModel
    simulate(p, times) --> y(p), y'(p), y''(p) :: DEFINED BY USER
    values(p, times) --> calls simulate or uses cached result :: DEFINED BY PINTS
    jacobian(p, times) --> calls simulate or uses cached result :: DEFINED BY PINTS
    hessian(p, times) --> calls simulate or uses cached result :: DEFINED BY PINTS

In this version, all models implement the simulate() method to return three parts (with two being None for models that don't support jacobians).

Alternatively, we could still have a split at the model level

ForwardModel
    simulate(p, times) --> y(p) :: DEFINED BY USER
    values(p, times) --> calls simulate or uses cached result

ForwardModelS1
    simulate(p, times) --> y(p), y'(p), y''(p) :: DEFINED BY USER
    values(p, times) --> calls simulate or uses cached result
    jacobian(p, times) --> calls simulate or uses cached result

ForwardModelS2
    simulate(p, times) --> y(p), y'(p), y''(p) :: DEFINED BY USER
    values(p, times) --> calls simulate or uses cached result
    jacobian(p, times) --> calls simulate or uses cached result
    hessian(p, times) --> calls simulate or uses cached result

This puts the caching issue at the lowest level, which avoids a lot of trouble. That does mean that if people want to implement something on one of the higher levels (so bypass the Problem/Model classes, like we do for the toy problems), that they don't benefit from it.

Situation 1: Method wants jacobian, model doesn't provide it. The call to jacobian would trickle down to the model level, where a TypeError would be raised because the method doesn't exist.

Situation 2: Model has f'(x) and/or f''(x), method doesn't need it. In this case, the jacobian (and hessian) will be calculated anyway, as the model has no way of knowing that these methods aren't wanted. (If we create some way, we lose the advantages of having a simple user interface and caching in the Pints classes). Again, not a huge deal though, as it would make more sense for people to write a model appropriate to what they want to do (i.e. not calculate unneeded stuff).

Least-effort?: A lot of Pints code stays the same, except we add on methods everywhere. It is trickier for the user though, as they need to know about the caching to understand what's going on. The model interface also gets more complicated, even if you don't use sensitivities.

Conclusion?

  1. Method 1 feels simpler to me (even though there's more classes, and Pints gets a bit bigger inside). It's also a bit cleaner in situation 2.
  2. My guess is that most users won't provide sensitivities, this is also an argument in favour of method 1, as the derivative-free stuff stays (as fast & simple) as it is

Curious to hear your thoughts. Sorry for the massive post @martinjrobins @ben18785 @sanmitraghosh @chonlei @mirams

martinjrobins commented 6 years ago

I have two options as well. Both of these are inspired by Michael's option 2)

Option 1: With Caching

Making the split at the model level, and using add-on methods for higher-level classes, seems like a good (i.e. simple) approach. I'm not keen on complicating the model interface for non-sensitivity models, could you have something like this so that ForwardModel remains as simple as possible?:

ForwardModel
    values(p, times) --> y(p) :: DEFINED BY USER

ForwardModelS1
    simulate(p, times) --> y(p), y'(p) :: DEFINED BY USER
    values(p, times) --> calls simulate or uses cached result
    jacobian(p, times) --> calls simulate or uses cached result

ForwardModelS2
    simulate(p, times) --> y(p), y'(p), y''(p) :: DEFINED BY USER
    values(p, times) --> calls simulate or uses cached result
    jacobian(p, times) --> calls simulate or uses cached result
    hessian(p, times) --> calls simulate or uses cached result

Yes, it does look a bit odd that users define values for non-sensitivity models, and simulate for sensitivity models, but if we are agreed that sensitivities will be useless, then for the vast majority of the time users will only implement ForwardModel, and not have to worry about the added complexity of ForwardModelS1 and ForwardModelS2

Option 2: With results object

The caching sounds complicated and error-prone. Perhaps we can have the models return some sort of object that has values, jacobian and hessian functions defined (or not defined).

ForwardModel
    simulate(p, times) --> ForwardModelResult(p, times) 
ForwardModelResult
    values() --> y(p)

ForwardModelS1
    simulate(p, times) --> ForwardModelS1Result(p, times) 
ForwardModeS1Result
    values() --> y(p)
    jacobian() --> y'(p) 

ForwardModelS2
    simulate(p, times) --> ForwardModelS2Result(p, times)
ForwardModeS2Result
    values() --> y(p) 
    jacobian() --> y'(p)
    hessian() --> y''(p) 

So effectively you still have the same "caching" behaviour, since the results objects would probably calculate everything in the constructor, but this would be explicit and not hidden

The downside of this is added complexity in the design for the forward models, but perhaps you could just the user to implement the constructor of the results object, then use some fancy Python decorator magic to generate the above design.

MichaelClerx commented 6 years ago

Option: n_derivatives() method

ForwardModel()
    n_derivatives() --> 0, 1, or 2
    simulate() --> y, (y, y'), or (y, y', y'')

XOutputProblem()
    n_derivatives() --> model.n_derivatives()
    evaluate() --> y, (y, y'), or (y, y', y'')

Then, we to check n_derivatives in the ErrorMeasure and ProblemLogLikelihood classes. At this level we can pass on n_derivatives() from the problem OR the highest one that we've implemented for that particular measure/loglikelihood. I think it's ok to do this here because the error measures / loglikelihoods already need to know about sensitivities

@martinjrobins suggests in this method we could add an optional argument to Model.simulate(derivatives=0) that indicates the level of derivatives we want. Most models can ignore this argument & the Problem classes can check if the requested level of derivatives are available...

MichaelClerx commented 6 years ago

Option: n_derivatives argument

Based on that, I propose


ForwardModel
    simulate(p, times, derivatives=0) --> y
    simulate(p, times, derivatives=1) --> (y, y')
    simulate(p, times, derivatives=2) --> (y, y', y'')
    n_derivatives() --> The maximum number of derivatives available (default=0)

XOutputProblem
    evaluate(p, derivatives=0) --> y
    evaluate(p, derivatives=1) --> (y, y') or an exception, if Model.n_derivatives() < 1
    evaluate(p, derivatives=2) --> (y, y', y'') or an exception, if Model.n_derivatives() < 2
    n_derivatives() --> The maximum number of derivatives available (passed on from Model)

This lets methods decide which version of a model they want (if models support this)
martinjrobins commented 6 years ago

yup, I think the n_derivatives argument is the way to go. Would higher level classes (e.g. log likelihood), have a derivatives argument as well (which they just pass down)?

MichaelClerx commented 6 years ago

@martinjrobins yup!

However, that method makes the documentation for ForwardModel much harder to read, so that the ForwardModel interface (users main entry point!) becomes harder to understand.

We can probably make life simpler by having:

ForwardModel
    simulate(p, times) --> y
ForwardModelS1
    simulate(p, times, derivatives=0) --> y
    simulate(p, times, derivatives=1) --> (y, y')
ForwardModelS2
    simulate(p, times, derivatives=0) --> y
    simulate(p, times, derivatives=1) --> (y, y')
    simulate(p, times, derivatives=2) --> (y, y', y'')

XOutputProblem
    evaluate(p, derivatives=0) --> y
    evaluate(p, derivatives=1) --> (y, y') or an exception if the model doesn't support it
    evaluate(p, derivatives=2) --> (y, y', y'') or an exception if the model doesn't support it
    n_derivatives() --> The maximum number of derivatives available (passed on from Model)

LogLikelihoods etc. still all use an n_derivatives() method to indicate their level of support

MichaelClerx commented 6 years ago

Closed with #358