probcomp / Gen.jl

A general-purpose probabilistic programming system with programmable inference
https://gen.dev
Apache License 2.0
1.79k stars 160 forks source link

Make Distributions Generative Functions #259

Open georgematheos opened 4 years ago

georgematheos commented 4 years ago

What if Distribution were a subtype of GenerativeFunction?

This would simplify implementations of some combinators and DSLs. For example, instead of having separate code to handle @traceing a distribution and generative function, we could only ever trace generative functions--and distributions just have the GFI implemented for them! Likewise, what if I want to have a generative function sample from a poisson 100 times? If we had Distribution <: GenerativeFunction, then we could simply use Map(poisson).

We would need a trace type to store a value from the distribution, for instance

struct DistributionTrace{Dist}
    args
    val
end
get_args(tr:: DistributionTrace) = tr.args
get_retval(tr:: DistributionTrace) = tr.val
function get_score(tr::DistributionTrace{Dist}) where {Dist}
    logpdf(Dist(), tr.val, tr.args...)
end
get_gen_fn(tr::DistributionTrace{Dist}) where {Dist} = Dist()

To implement get_choices it seems like we would need a choicemap which has no address, and a single value. I think changing the definition of a choicemap to support this is worth considering; I have opened issue #258 to discuss this. Following my proposal in this issue, we could define

get_choices(tr::DistributionTrace) = ValueChoiceMap(tr.val)

I think implementing generate, simulate, update, and regenerate would be pretty straightforward. The internal proposal distribution and the distribution for a Distribution generative function would be the same: the Distribution itself.

marcoct commented 4 years ago

I agree this would simplify parts of the implementation, and I'm really glad you're bringing this up now.

I considered this before decided not to at the time because in some previous probabilistic programming languages in the lab, there was some confusion about the distinction between return values of functions (which cannot in general be constrained, and therefore do not have addresses in Gen choice maps), and values that can be constrained. In this earlier language, the return value of a function had the empty address, and this made thinking about inference constructs more confusing.

The other reason was to provide the ability to specialize to the special case of a Distribution (i.e. a generative function whose return value is constrainable) with no overhead. Using a subtype as you suggested in https://github.com/probcomp/Gen/issues/258 could allow for language implementation code to specialize to the case of the Distribution with little or no overhead (avoiding runtime checks), probably, so that issue might be moot.

I think having uniformity here would be really nice; but I want to make sure we are able to clearly describe the different cases to users. I want to make sure they don't start trying to constrain the return value of generative functions, or expect these values to appear in choice maps (because choice maps are for constrainable choices only, and that's a central part of Gen's design that allows for choice maps to be the abstract representation for random variates that gets passed between proposals and models).

I think we should iterate a bit on how this would be described to users. Here is a start. After describing generative functions and random choices and addresses (without talking about what a Distribution actually is), then we say:

"Distributions are a special type of generative function that always have exactly one address in their choice map, and this address is the empty address. The value at this address is always equal to the return value."

We would also need to:

marcoct commented 4 years ago

and distributions just have the GFI implemented for them! Likewise, what if I want to have a generative function sample from a poisson 100 times? If we had Distribution <: GenerativeFunction, then we could simply use Map(poisson).

I think a requirement for this change is that it doesn't introduce substantial performance regressions. I think that to avoid the performance regression, some language implementations will want to specialize on Distribution (to avoid runtime checks or lookups etc.), so there might not be as much code reduction as we might expect.

marcoct commented 4 years ago

The high-level takeway from my responses are that (i) we would want to make sure the conceptual model is explained well, and (ii) I think that to avoid performance regressions, there might not be as much code reduction as you are expecting.

But it does seem awesome to be able to use distributions in all the ways that generative functoins can be used (e.g. Map), and then let implementers of different components to specialize to Distributions if they want. That could potentially expand coverage out of the box for various features as you mentioned.

And I think this is definitely worth prototyping, benchmarking, and iterating on. There is probably a good way of at least reducing the redundant code without introducing performance regressions.

alex-lew commented 4 years ago

I agree with all this, and will add that:

  1. It could be useful to allow arbitrary generative functions to support constraining the return value, if they want to (e.g., a binomial that exposed each coin flip as a trace address). Alternatively, we could think of this as Distributions with internal trace structure. I guess I just mean: constraining the return value and having a ValueChoiceMap don't seem inherently coupled to me, if we are going to move in the direction of having everything be a generative function.

  2. Re: boxed ValueChoiceMaps -- I agree we'd need to see if this was a performance regression. I do think there are benefits to having distributional data encoded in the trace, if we can find a way to do it performantly. E.g., for implementing algorithms that perform enumeration over discrete variables. But I think performance is the priority for something as far-reaching as the structure of the trace of a choice. (I think Turing may store distributional info in their traces? Their traces are also typed, which I think helps w/ performance? And my current read on their approach, which could be wrong, is that it only works because they assume a fixed and finite number of top-level addresses -- no if statements determining whether a choice exists, e.g. -- where each top-level address represents an array of similarly-typed choices.)

marcoct commented 4 years ago

(I think Turing may store distributional info in their traces? Their traces are also typed, which I think helps w/ performance? And my current read on their approach, which could be wrong, is that it only works because they assume a fixed and finite number of top-level addresses -- no if statements determining whether a choice exists, e.g. -- where each top-level address represents an array of similarly-typed choices.)

That's interesting. That sounds like it has something in common with Gen's SML, which also generates typed traces.

I think I might be okay with some performance regression on DML if SML stayed fast (including recent performance gains due to @georgematheos, we are not losing those. :) )

marcoct commented 4 years ago

It could be useful to allow arbitrary generative functions to support constraining the return value, if they want to (e.g., a binomial that exposed each coin flip as a trace address). Alternatively, we could think of this as Distributions with internal trace structure. I guess I just mean: constraining the return value and having a ValueChoiceMap don't seem inherently coupled to me, if we are going to move in the direction of having everything be a generative function.

Those two representations of a binomial sound like two different generative functions to me.

I am going to be hesitant to agree to relaxing many of the design choices that make generative functions and traces easy to reason about statically (e.g. what address schema they use), until we have some work on an inference compiler.

alex-lew commented 4 years ago

Those two representations of a binomial sound like two different generative functions to me.

To be clear, I totally agree --- I just mean that you might like to write either, and that in the second case, you might want to be able to constrain the return value (if the GF supports it).

Also agree about being conservative about relaxing design choices.

georgematheos commented 4 years ago

Re formalisms: I agree with @marcoct that clearly articulating what a distribution, choicemap, trace, etc. is is important, especially when making a change like this. I’m not sure we should get rid of the abstract trace type; it is useful for providing default functions/error throwing; simply wrapping a value in a trace shouldn’t have a big performance impact for SML, I think (more on this later). The biggest issue I have with traces is that every trace stores it’s own args, meaning that something in the DSLs, when an output from one gen function is routed into another, we store the same value twice, once as an arg and once as a return value. I don’t know if this is really that big an issue, though, and don’t have any suggestions on it at the moment; my feeling is that for the most parts, traces are pretty useful.

The way I’m thinking about the different objects in Gen is:

I agree with Alex that there are cases where it would be elegant to allow users to constrain the return value of a generative function. (Ie. Can I @trace(..., _) to denote that I want this random decision to have no address, and be the return value?). But I think this also introduces some challenges for the formalism, and might be a separate issue, so I didn’t try to account for it in the above, where I tried to articulate a means by which we clearly articulate “updateable” values—namely, those in a leaf node in a choicemap—from non-updateable values (values which are not traced, or are a consequence of some non-leaf-node in a choicemap).

Practically speaking, I’m in favor of having some special DistributionTrace{Dist} type which stores the value sampled from a distribution and the args (and maybe can cache the score?). I think it is probably the right call to have distributions continue to support logpdf and random, and build the GFI methods out of that, though one could argue that it’s more elegant to only implement the GFI methods, which after all contain logpdf and random’s functionality.

Re Performance: For sure, making sure performance doesn’t suffer is essential; I think that having a distinct subtype Distribution so users can specialize methods to it for performance is a good idea. I’m also hopeful that even without specializing to Distribution, performance will be pretty good, though we’ll need to benchmark to be sure. What I’m hoping will happen is that if we store information about what distribution is being sampled from in the distribution trace type, Julia will be able to compile any calls to the GFI into the proper calls to logpdf and random.

Ie. If we write

@inline get_score(tr::DistributionTrace{Dist}) where {Dist} = logpdf(Dist(), tr.val, tr.args...)

I’m hoping that when we call get_score, after compilation, it will be exactly as fast as if we had just called logpdf in three first place. Since the trace probably needs to store the args, there will be a little bit of overhead to store these when we call simulate/generate, but for instance in the DSLs we already have to have a line to stash these args somewhere, so we are probably just moving this line of code rather than adding one. I’m also hoping that wrapping the return value from a distribution in a trace won’t add any overhead; what I hope will happen is that Julia will use this “trace type” information at compile-time to figure out the program flow, but that after compilation it will just essentially pass around the value and not need any Trace-type metadata during runtime.

That said, we should certainly run some tests on this; I’m not sure exactly how well optimized Julia is. For instance, if we have an inlined method which returns (tr, 0) every time, and a caller which always discards the 0, is Julia smart enough to know it doesn’t even need the 0 output?