kernel distributions - Githubissues

nilsbecker commented 1 year ago

https://igarnier.github.io/prbnmcn-dagger/prbnmcn-dagger/Dagger/Dist/index.html#val-kernel

this line is not entirely clear to me: when specifying a transition kernel for sampling, i would assume that one then gets an mc chain that converges to the stationary distribution of that kernel. for that, knowing the log-transition density would not be necessary. why is it nevertheless required to add it? for instance the brownian kernel adds a gaussian transition density which would not be needed to just sample brownian paths (with uniform stationary distribution) . one could just accept every gaussian proposal.

i tried to use the source, but the cps-style makes things a bit intransparent for me. here it looks like there is a metropolis-hastings rejection step which uses the transition log density. (but the additional summing of total scores confuses me).

as stated above, naively i would think this would not be needed: one can implement metropolis-hastings simply by calling Lmh_inference.score?

but since rejection appears here, does this mean that we can specify a transition density which is different from the one of the transition kernel used for proposing step? can we thereby implement a rejection sampling of steps already at the level of specifying a transition kernel?

disclaimer: confused; possibly nonsense.

[edit]: it seems the line i referenced actually does the mh rejection for the full run, including calls to score. so then my question is, why does the log-transition density enter here? actually, also when sampling from a Stateless distribution, the log-likelihood enters, which also is surprising to me. i was thinking: if we already sample from a prior, we do not need in addition to introduce a weight for the samples, since that would be double weighting. clearly i am missing something basic.

nilsbecker commented 1 year ago

ok, so i reread the lightweight mcmc paper and most of my confusion has cleared up.

i now realize that it's all about mcmc sampling in trace space, so one needs the likelihood for a path in order to propose an update for an entire execution trace including all downstream effects any given modified random choice. this line appears to do the downstream continuation (incidentally, why is Obj.magic necessary?)

in the paper, a proposal kernel appears in the acceptance criterion, but only as forward/backward ratio. this makes sense when thinking about the fact that the kernel is just for proposing, not for scoring; any imbalance between forward and backwards transitions therefore is taken out. so in practice, for symmetric kernels, one would not need the kernel transition pdf after all.

a different use for a kernel density would be to actually generate samples (not proposals). then i can see that the kernel density will be needed for scoring the execution trace, when proposing a MCMC update of traces. so i kind of get why a kernel also needs a density.

nilsbecker commented 1 year ago

PS. what i still do not understand in the lightweight paper is the appearance of a sort of a path entropy log |D| and the accounting for fresh and discarded log-likelihoods ll_{fresh,stale} in the MH acceptance criterion; it's also not explained in any way.

igarnier commented 1 year ago

i now realize that it's all about mcmc sampling in trace space, so one needs the likelihood for a path in order to propose an update for an entire execution trace including all downstream effects any given modified random choice. this line appears to do the downstream continuation

Yes, exactly!

(incidentally, why is Obj.magic necessary?)

I must have had trouble devising a way to propagate the global type of the computation (type a in stream_samples) in the other functions. The return type of continuations is existentially quantified away, so we have to use Obj.magic to recover it (it's safe but ugly). I can probably do better (I just managed to get rid of them in smc_inference).

PS. what i still do not understand in the lightweight paper is the appearance of a sort of a path entropy log |D| and the accounting for fresh and discarded log-likelihoods ll_{fresh,stale} in the MH acceptance criterion; it's also not explained in any way.

I don't know. I found this paper hard to understand as well. Other references that were useful to me are:

nilsbecker commented 1 year ago

thanks, that's helpful. apparently there are some correctness issues with the wingate et al algorithm -- what is the algorithm that ended up in Lmh_inference?

[edit] wow, after reading o. kiselyovs "problems" notes, it's clear that correctness of sampling is not a given in the literature on probabilistic programming. clearly, there are some serious flaws in the original algorithm. unfortunately, while he criticizes the acceptance ratio in wingate et al, he fails to fully explain why he thinks it is accidentally correct. it appears that at least the problem with 'conditioning within conditional branches' is solved in a more recent version of hakaru10.

igarnier commented 1 year ago

what is the algorithm that ended up in Lmh_inference

I don't have a specific paper to point to from which I would have copied the implementation, I rather re-derived it from the abstract descriptions in the 3 first references I gave. The last paper, I discovered after I implemented the two LMH inference algorithms but the proposed approach seems pretty close to the one I came up with. As an additional reference point, the incremental LMH in dagger should be close to Hakaru10.

Regarding correctness, I claim that the implementations in dagger behave correctly on the example given in A Provably Correct Sampler for Probabilistic Programs so at least that particular bug should not be present :) (there's a test in the test suite). Ideally, I should test dagger against Hakaru10 on the examples provided on Oleg's papers and check that the posteriors match.

edit: I added a test for the mixture model in the first page of Oleg's PPS2017 paper and all samplers in dagger produce the correct result.

nilsbecker commented 1 year ago

PS. the last reference was very helpful. a thorough textbook-like discussion.

i now get the logic behind the acceptance ratios: by splitting up the acceptance ratio into a proposal probabilities of a monte-carlo move and acceptance probabilities, things become clearer. the move begins with selecting one of the ERPs along the trace with uniform probability; that's where the log(trace length) contribution comes from. then, moves are proposed for this and the dependent ERPs; if the proposas are chosen by always sampling new according to the priors, the contribution for the proposal will cancel with the corresponding contribution for the prior probability of that trace. the acceptance ratio then only retains the scored weights, which correspond to the log likelihood.
finally, when randomness is reused by keeping ERP values but reweighting them due to changed parameters for their prior distribution, this adds another contribution only for the reused ERPs. much happier now

but it also becomes clear that for realistic problems, it is mostly going to be difficult to generate good enough proposals by single-site MH. the likelihood often depends in very sensitive ways on early random choices. so a naive move of an early ERP is likely to be rejected almost with certainty. my mental image for this is a semiflexible polymer simulation in 2D. consider a chain of N links of equal length. the first link is clamped at the origin. the ERPs are the angles between adjacent links, each with a prior that favors it to be straight. let's say we're interested in statistics of chain conformations constrained to form a loop: the Nth link is supposed to return to the origin. we build this into the model by a likelihood that strongly favors the last link to be in a small neighborhood of the origin. now any single-angle move will pivot the whole rest of the chain away from the origin, resulting in unusably poor sampling with naive single-site moves.

of course, there is a whole scientific field that has developed clever moves that can deal with this over the last 50 years or so. we should not expect a general-purpose tool to be able to solve the general problem of difficult parameter spaces. but a general-purpose tool should give the user good options to implement better moves.

i found the last chapters in the book promising in this regard. it would be cool to be able to generate good collective moves automatically via backpropagation using algorithmic differentiation. i've tried to use AD in owl but was not convinced of it's ergonomics. essentially, all type safety goes away because all AD compatible numeric types are wrapped into one opaque AD.t. but clearly this combination (prob. programming + backpropagation) is where one needs to go. (and apparently Stan is already there) maybe https://github.com/lukstafi/ocannl will be better?

igarnier / prbnmcn-dagger

kernel distributions #4