Data comparison mini DSL

richfitz commented 1 year ago

Two proposals for a data modelling/probablistic language that we need in odin/odin.dust (initially this is only something in odin.dust)

Considerations:

we'll retain as much as possible the idea of data ~ distribution syntax that people are familiar with from things like stan and bugs, but without
all lhs of such statements refer to data, but the rhs could be any or all of data, model output, and parameters, including simple expressions
we want to be able to generate fowards (not just look up densities) so we'll avoid things like x ~ dpois(lambda) in favour of x ~ poisson(lambda) as we can then generate both simulations of data as well as likelihoods from data
intermediate model outputs cannot be used in the compare (i.e., all model outputs need to have an update() statement)
because not all data elements need to occur on the lhs of the statement, we do need a facility to declare additional data elements (this then flows though to the struct and to the dependency graph)
any time we hit missing data, no likelihood contribution comes from that section

The actual bits of differentiation and generating compiled code are not actually very hard once we have the interface done, and dust already supports almost everything that we need here.

We already have a few densities implemented in dust already and we know that this is enough to do the core bits required by sircovid once we deal with the generating process bits. This is easy enough to expand but there is a question of whether we want to make this user expandable (for now I think that we won't).

Here are two starting points with hypothetical implementations of the interface, for discussion. Neither work at present, and both require changes to be made in odin (as well as the implementation in odin.dust)

In blocks, like stan:

This is an excerpt of the support for the sircovid "basic" model showing the compare "block" and all the bits required to support it:

phi_ICU <- user()
kappa_ICU <- user()
phi_death <- user()
kappa_death <- user()

update(D_inc) <- ...
update(I_ICU_tot) <- ...

config(data) <- {
  deaths ~ real
  icu ~ real
}

config(compare) <- {
  deaths ~ negative_binomial(kappa_death, phi_death * D_inc)
  icu ~ negative_binomial(kappa_icu, phi_ICU * I_ICU_tot)
}

It's possible that we could move the actual compare bit into its own file:

config(compare) <- "compare.R"

with compare.R containing:

deaths ~ negative_binomial(kappa_death, phi_death * D_inc)
icu ~ negative_binomial(kappa_icu, phi_ICU * I_ICU_tot)

This is slightly easier to implement but leads to things scattered in several files (we might do that same approach for the data too).

I don't love the data block here but struggle to come up with something nice. We could just infer all lhs of ~ expressions as data and add something to declare additional ones. There's no real strong reason we need to have any declaration that things are real or integer as in practice everything is real, so we just need a list. We could also support something like:

config(data) <- c("deaths", "icu")

(with or without quotes). Or we could do something like

config(compare) <- function(deaths, icu) {
  deaths ~ negative_binomial(kappa_death, phi_death * D_inc)
  icu ~ negative_binomial(kappa_icu, phi_ICU * I_ICU_tot)
}

which captures the idea of the data inputs (arguments to the function) but is pretty rude about what a function is. Finally we could merge the two blocks like

config(compare) <- {
  deaths ~ real
  icu ~ real
  deaths ~ negative_binomial(kappa_death, phi_death * D_inc)
  icu ~ negative_binomial(kappa_icu, phi_ICU * I_ICU_tot)
}

Inline, like odin:

A different approach might look like new odin functions like this:

phi_ICU <- user()
kappa_ICU <- user()
phi_death <- user()
kappa_death <- user()

update(D_inc) <- ...
update(I_ICU_tot) <- ...

deaths <- data()
icu <- data()

compare(deaths) ~ negative_binomial(kappa_death, phi_death * D_inc)
compare(icu) ~ negative_binomial(kappa_icu, phi_ICU * I_ICU_tot)

(here, we might want to rename compare as something else). This requires a little more surgery in odin but nothing too bad really. I think that this works quite well for the data inputs, and provides a natural way of declaring integer inputs if they become needed (deaths <- data(integer = TRUE) perhaps).

Data in the comparison function

One example we found with data on both the lhs and rhs is something like

observed ~ binomial(n_tests, prob)

where observed is some observed count data, n_tests is some observed number of tests (so both of these from the data) and prob is some model quantity

edknock commented 1 year ago

I like the inline approach, feels more in keeping with current odin?

MJomaba commented 1 year ago

My gut feelings goes for the inline version, looks neater
LHS should allow data but also parameters, to allow inference of "missing data"
For comfort of use (maybe favours the inline version), we should allow

config(compare) <- function(deaths, icu) {
  deaths ~ negative_binomial(kappa_death, phi_death * D_inc)
  icu ~ negative_binomial(kappa_icu, phi_ICU * I_ICU_tot)
}

to be written also like this

config(compare) <- function(deaths, icu) {
  p_deaths <- phi_death * D_inc
  deaths ~ negative_binomial(kappa_death, p_deaths)
  p_icu <- phi_ICU * I_ICU_tot
  icu ~ negative_binomial(kappa_icu, p_icu)
}

you could use also y ~ dpois(lambda) vs y <-rpois(lambda) to distinguish simulation from evaluation of density, though y ~ poisson(lambda) is also fine

annecori commented 1 year ago

I have much less experience with the fitting code and haven't followed all the discussions around this, so to be taken with a pinch of salt.

However (before seeing previous comments on this thread) my gut feeling would have been that the block bit is best (and I'm not a Stan user so not really influenced by that). The reason for that is that I think we may want to leave the "model" completely separate from the data and fitting. So you could fit the same model to different data, in different ways, with a model file which stays the same throughout. Thinking about the work Pablo is currently doing for example trying to fit the same model to multiple datasets which are not all the same in structure etc. But maybe I'm missing something?

I think that this is not nice: config(compare) <- { deaths ~ real icu ~ real deaths ~ negative_binomial(kappa_death, phi_death * D_inc) icu ~ negative_binomial(kappa_icu, phi_ICU * I_ICU_tot) } So I'd avoid that.

I agree with @MJomaba that LHS should allow data but also parameters, to allow inference of "missing data" (and hierarchical models / hyperpriors I think).

Also agree with @MJomaba re the "recalculation" of parameters, that would be nice if possible.

Finally, I like y ~ poisson(lambda) and don't think we should allow two distinct statements when one is enough! So disagree with @MJomaba on this one :-)

cwhittaker1000 commented 1 year ago

Really exciting stuff @richfitz !

Agreed re the LHS needing to be able to handle parameters (I think the hierarchical case @annecori mentions seems particularly pertinent).

I think on balance I prefer the blocks version. The inline version would have been my (weakly held) preference but I think Anne's point about wanting to potentially fit the same model to different data (or with different likelihoods) seems very relevant.

In that context, I think ideally you'd want to be able to use the config(compare) <- "compare.R" approach (and just sub in different"compare.R" files each time), and I'm struggling to think of a way in which that file approach could be used if you went down the inline route. I think the block approach opens up the possibility of the file-based approach, and outside of that, seems pretty comparably good to the inline approach otherwise.

(I am a Stan user, so that take might be biased)

edknock commented 1 year ago

I agree with @annecori that y ~ poisson(lambda) is better as it will cover both rpois and dpois (and potentially others if needed eventually? e.g. ppois or qpois)
Maybe I've missed something but I don't understand why you couldn't fit the same model to different data with the inline approach? Like with the current sircovid model, we have one compare function that can be (and is) used to fit the same odin model to different data (e.g. age-aggregated vs age-specific), datastreams can be switched off by inputting them as NA. Maybe you have an example showing why this wouldn't work in the inline approach @annecori?
LHS would probably need to support augmented data, I would expect this would be thought of more as part of the model rather than parameters? Generally then LHS wouldn't necessarily just be data.
@annecori wouldn't hyperpriors be thought of as part of the priors rather than the likelihood? We have been discussing implementation of hyperpriors as a properly hierarchical multiregion approach, alongside broader implementation of dependent or joint priors (which probably would be useful in single region modelling too). If you have thoughts on this it would be great to discuss!

richfitz commented 1 year ago

Thanks all for the comments so far. Some comments that might help with the above

I can understand the desire for flexibility with multiple compares for a single model, but that will run up against a slight struggle in how we have dust set up; you might be able to switch between different compare definition files at compile time but fundamentally you'd still be ending up with a finished model that has baked in a single comparison function along side a single model. It's possible that we could allow some sort of "include this file" statement that would allow slurping in of arbitrary odin code in the case of complex models that you could then structure your model with. Does stan etc offer any sort of support for this? It's a fairly similar situation I'd imagine where some parts of that you might want to share between different things.

Ed is right that the current approach allows for a certain degree of flexibility while baking in a single compare function. What will happen is that if you have

observed ~ poisson(lambda)

and you don't have observed data (i.e., it is present as a column in the data that you set, but is NA) then there will be no contribution to the likelihood for that part of the process. This is what we do in sircovid and switch between two "models" by validating earlier that only one of the two possibilities for data is set (all from memory).

It does sound like we should be quite flexible about what goes on the lhs; I'll try and wire it up not to constrain things there at all

lwhittles commented 1 year ago

I'm team inline, it's more in-keeping with current syntax, and would be just as flexible as the current system in terms of using different compares with the same model right? If I'm understanding, the compare function lives in a separate script to the odin code for the model. So you could have multiple alternative compares and choose between them at the point of creating the pfilter? E.g. I have my model and I want to fit with both a negbinom and a betabinom compare function, so I write a fitting task in orderly with a parameter that selects the relevant compare function, the script then 'bakes' it in and runs the pmcmc.

I also like the observed ~ possion(lambda) syntax, as I think this will also help with coding up the equivalent of an rmeasure in pomp (where dmeasure is our compare). E.g. say we have:

config(compare) <- function(cases) { cases ~ binomial(infections, p_ascertainment) }

After we have fitted our model, we can't immediately visually compare the model fit to the data as we don't have any modelled quantity that corresponds to the observed cases (obviously in this simple example we could easily code up an odin output that was cases <- infections * p_ascertainment but that doesn't get at the measurement error we have allowed for, and isn't always the case)

In pomp, the rmeasure would generate: model_cases <- rbinom(model_infections, p_ascertainment (a statement which could be derived easily from our compare function syntax) so that for every data stream on the LHS of an expression, we produce a model quantity that will visually correspond. I code up these myself generally and call the function observe

MJomaba commented 1 year ago

A few points to feed in discussion, some in response to previous comments from others:

I find the potential mix of inline (odin/dust) and block code clunky. My feeling is that they should be largely equivalent with regards with what you can do with them, but mixing both would be awkward. It's not that the block philosophy is bad, but in stan e.g. it is used throughout so it flows from the way things are structured;
regarding a potential impact on modularity; the whole suite (odin, dust, mcstate) has been built around modularity, and as pointed by @lwhittles and @edknock there exist solution to study combination of model/data/compare functions. In the definition of the filter itself you have to pass the three of them, so you can work at that level; and compare should be able to handle several level of subsampling of your "bigger" dataset; there remains an issue that performance might be better with one piece of C++ code, but that might be another issue, @richfitz seems to imply that there might be way around that but I'm not sure it's a priority at the moment;
regarding the Poisson vs (rpois, dpois); I think that I might have not get my previous point properly. What I was trying to say is that there is a distinction between rpois and dpois and that it is important, but Poisson() could provide a writing for both and can be interpreted with context. I think the point of @lwhittles is good, and that might make the interpretation of Poisson in the compare as different in the context; for likelihood evaluation, it would be interpreted as dpois, for visualisation/simulation of data (like in POMP), it would be interpreted as rpois. Actually I think that even in the main body of odin/dust code, what is now written as rpois (or rwhatever) could be interpreted as dpois (or dwhatever) in the context of calculating the joint density of parameter * particular realisation. In this context the joint density is a product of densities of your observations functions times the densities governing the probabilities of the realised transitions, so should be interpreted as dpois (or dwhatever).

richfitz commented 1 year ago

Thanks all - this is super useful. There are obviously (or non-obviously) some internal constraints based on the design but I'll try and pull together a proof-of-concept in the next month or so

richfitz commented 1 year ago

Progress on this up here: https://github.com/mrc-ide/odin.dust/pull/131 and https://github.com/mrc-ide/odin/pull/294

I think that this looks pretty good and not that disruptive to the syntax tbh, and I think with fairly minimal changes it will support most of the above. Heaps to add (especially things like validation etc) but I'll expand that and try and capture our existing models with this approach (plus do the compilation out to get GPU functions too)

mrc-ide / odin.dust