Make Distributions Generative Functions

What if Distribution were a subtype of GenerativeFunction?

This would simplify implementations of some combinators and DSLs. For example, instead of having separate code to handle @traceing a distribution and generative function, we could only ever trace generative functions--and distributions just have the GFI implemented for them! Likewise, what if I want to have a generative function sample from a poisson 100 times? If we had Distribution <: GenerativeFunction, then we could simply use Map(poisson).

We would need a trace type to store a value from the distribution, for instance

struct DistributionTrace{Dist}
    args
    val
end
get_args(tr:: DistributionTrace) = tr.args
get_retval(tr:: DistributionTrace) = tr.val
function get_score(tr::DistributionTrace{Dist}) where {Dist}
    logpdf(Dist(), tr.val, tr.args...)
end
get_gen_fn(tr::DistributionTrace{Dist}) where {Dist} = Dist()

To implement get_choices it seems like we would need a choicemap which has no address, and a single value. I think changing the definition of a choicemap to support this is worth considering; I have opened issue #258 to discuss this. Following my proposal in this issue, we could define

get_choices(tr::DistributionTrace) = ValueChoiceMap(tr.val)

I think implementing generate, simulate, update, and regenerate would be pretty straightforward. The internal proposal distribution and the distribution for a Distribution generative function would be the same: the Distribution itself.

I agree this would simplify parts of the implementation, and I'm really glad you're bringing this up now.

I considered this before decided not to at the time because in some previous probabilistic programming languages in the lab, there was some confusion about the distinction between return values of functions (which cannot in general be constrained, and therefore do not have addresses in Gen choice maps), and values that can be constrained. In this earlier language, the return value of a function had the empty address, and this made thinking about inference constructs more confusing.

The other reason was to provide the ability to specialize to the special case of a Distribution (i.e. a generative function whose return value is constrainable) with no overhead. Using a subtype as you suggested in https://github.com/probcomp/Gen/issues/258 could allow for language implementation code to specialize to the case of the Distribution with little or no overhead (avoiding runtime checks), probably, so that issue might be moot.

I think having uniformity here would be really nice; but I want to make sure we are able to clearly describe the different cases to users. I want to make sure they don't start trying to constrain the return value of generative functions, or expect these values to appear in choice maps (because choice maps are for constrainable choices only, and that's a central part of Gen's design that allows for choice maps to be the abstract representation for random variates that gets passed between proposals and models).

I think we should iterate a bit on how this would be described to users. Here is a start. After describing generative functions and random choices and addresses (without talking about what a Distribution actually is), then we say:

"Distributions are a special type of generative function that always have exactly one address in their choice map, and this address is the empty address. The value at this address is always equal to the return value."

We would also need to:

Decide on the trace. I think it's the trace of a Distribution would probably just be the value itself to start. This would require us to remove the Trace abstract type (which I don't think is actually useful anyways). Edit: I see your suggestion that it be a wrapper around the value that knows the Distribution type, like regular GF traces are. This is an example of the type of boxing that could introduce performance regressions. The current implementation decides to accept the additional code overhead of unboxing the values of Distributions.
Decide whether to keep extra methods like logpdf etc. that are specialized to Distributions (I think yes, the GFI would be implemented using these methods).
Decide whether there needs to be a some token that represents the empty address. Maybe not? I guess there would be a special type of choice map that is a value choice map that has no address.
As you mentioned there would need to be some (probably simplifying) refactoring of the choice map implementation.
Think about other implications, including for auxiliary state. We could still allow the return value of all generative functions (distributions or otherwise) to be accessible at the (in general, hierarchical) trace address. That is pretty clean. But the user will need to know that get_choices() returns something that does not have entries for the return values of all generative functions, just Distributions.

and distributions just have the GFI implemented for them! Likewise, what if I want to have a generative function sample from a poisson 100 times? If we had Distribution <: GenerativeFunction, then we could simply use Map(poisson).

I think a requirement for this change is that it doesn't introduce substantial performance regressions. I think that to avoid the performance regression, some language implementations will want to specialize on Distribution (to avoid runtime checks or lookups etc.), so there might not be as much code reduction as we might expect.

The high-level takeway from my responses are that (i) we would want to make sure the conceptual model is explained well, and (ii) I think that to avoid performance regressions, there might not be as much code reduction as you are expecting.

But it does seem awesome to be able to use distributions in all the ways that generative functoins can be used (e.g. Map), and then let implementers of different components to specialize to Distributions if they want. That could potentially expand coverage out of the box for various features as you mentioned.

And I think this is definitely worth prototyping, benchmarking, and iterating on. There is probably a good way of at least reducing the redundant code without introducing performance regressions.

I agree with all this, and will add that:

It could be useful to allow arbitrary generative functions to support constraining the return value, if they want to (e.g., a binomial that exposed each coin flip as a trace address). Alternatively, we could think of this as Distributions with internal trace structure. I guess I just mean: constraining the return value and having a ValueChoiceMap don't seem inherently coupled to me, if we are going to move in the direction of having everything be a generative function.
Re: boxed ValueChoiceMaps -- I agree we'd need to see if this was a performance regression. I do think there are benefits to having distributional data encoded in the trace, if we can find a way to do it performantly. E.g., for implementing algorithms that perform enumeration over discrete variables. But I think performance is the priority for something as far-reaching as the structure of the trace of a choice. (I think Turing may store distributional info in their traces? Their traces are also typed, which I think helps w/ performance? And my current read on their approach, which could be wrong, is that it only works because they assume a fixed and finite number of top-level addresses -- no if statements determining whether a choice exists, e.g. -- where each top-level address represents an array of similarly-typed choices.)

(I think Turing may store distributional info in their traces? Their traces are also typed, which I think helps w/ performance? And my current read on their approach, which could be wrong, is that it only works because they assume a fixed and finite number of top-level addresses -- no if statements determining whether a choice exists, e.g. -- where each top-level address represents an array of similarly-typed choices.)

That's interesting. That sounds like it has something in common with Gen's SML, which also generates typed traces.

I think I might be okay with some performance regression on DML if SML stayed fast (including recent performance gains due to @georgematheos, we are not losing those. :) )

It could be useful to allow arbitrary generative functions to support constraining the return value, if they want to (e.g., a binomial that exposed each coin flip as a trace address). Alternatively, we could think of this as Distributions with internal trace structure. I guess I just mean: constraining the return value and having a ValueChoiceMap don't seem inherently coupled to me, if we are going to move in the direction of having everything be a generative function.

Those two representations of a binomial sound like two different generative functions to me.

I am going to be hesitant to agree to relaxing many of the design choices that make generative functions and traces easy to reason about statically (e.g. what address schema they use), until we have some work on an inference compiler.

Those two representations of a binomial sound like two different generative functions to me.

To be clear, I totally agree --- I just mean that you might like to write either, and that in the second case, you might want to be able to constrain the return value (if the GF supports it).

Also agree about being conservative about relaxing design choices.

Re formalisms: I agree with @marcoct that clearly articulating what a distribution, choicemap, trace, etc. is is important, especially when making a change like this. I’m not sure we should get rid of the abstract trace type; it is useful for providing default functions/error throwing; simply wrapping a value in a trace shouldn’t have a big performance impact for SML, I think (more on this later). The biggest issue I have with traces is that every trace stores it’s own args, meaning that something in the DSLs, when an output from one gen function is routed into another, we store the same value twice, once as an arg and once as a return value. I don’t know if this is really that big an issue, though, and don’t have any suggestions on it at the moment; my feeling is that for the most parts, traces are pretty useful.

The way I’m thinking about the different objects in Gen is:

The overall goal of Gen is to make it easier to run inference on different sorts of probability models
The core probability modeling tool is the “generative function”. A generative function is a description of a computational process which may make some random decisions; it has a precise mathematical definition in terms of the p & q distribution.
A generative function produces an artifact called a “trace” which describes the execution flow of a run of the function, which we can run inference on. Core to a “trace” is an entity called a “choicemap” which allows users to reference specific isolatable random decisions made within the generative function, to constrain the values of these decisions, or produce new traces with these values having been changed.
A choicemap is a graph. The leaf nodes are of type ValueChoiceMap, and store a value. An internal node has a set of distinguishable addresses pointing to “sub-choice-maps”. The values in the leaf nodes of a choice map are “simple” enough that users may update these values to new values, and the generative function can update the rest of the trace accordingly. Non-leaf-nodes represent composite information and cannot be modified directly.
A distribution is a special type of Generative Function whose choicemap is a leaf node. Thus we can update the return value of a distribution in one step.

I agree with Alex that there are cases where it would be elegant to allow users to constrain the return value of a generative function. (Ie. Can I @trace(..., _) to denote that I want this random decision to have no address, and be the return value?). But I think this also introduces some challenges for the formalism, and might be a separate issue, so I didn’t try to account for it in the above, where I tried to articulate a means by which we clearly articulate “updateable” values—namely, those in a leaf node in a choicemap—from non-updateable values (values which are not traced, or are a consequence of some non-leaf-node in a choicemap).

Practically speaking, I’m in favor of having some special DistributionTrace{Dist} type which stores the value sampled from a distribution and the args (and maybe can cache the score?). I think it is probably the right call to have distributions continue to support logpdf and random, and build the GFI methods out of that, though one could argue that it’s more elegant to only implement the GFI methods, which after all contain logpdf and random’s functionality.

Re Performance: For sure, making sure performance doesn’t suffer is essential; I think that having a distinct subtype Distribution so users can specialize methods to it for performance is a good idea. I’m also hopeful that even without specializing to Distribution, performance will be pretty good, though we’ll need to benchmark to be sure. What I’m hoping will happen is that if we store information about what distribution is being sampled from in the distribution trace type, Julia will be able to compile any calls to the GFI into the proper calls to logpdf and random.

Ie. If we write

@inline get_score(tr::DistributionTrace{Dist}) where {Dist} = logpdf(Dist(), tr.val, tr.args...)

I’m hoping that when we call get_score, after compilation, it will be exactly as fast as if we had just called logpdf in three first place. Since the trace probably needs to store the args, there will be a little bit of overhead to store these when we call simulate/generate, but for instance in the DSLs we already have to have a line to stash these args somewhere, so we are probably just moving this line of code rather than adding one. I’m also hoping that wrapping the return value from a distribution in a trace won’t add any overhead; what I hope will happen is that Julia will use this “trace type” information at compile-time to figure out the program flow, but that after compilation it will just essentially pass around the value and not need any Trace-type metadata during runtime.

That said, we should certainly run some tests on this; I’m not sure exactly how well optimized Julia is. For instance, if we have an inlined method which returns (tr, 0) every time, and a caller which always discards the 0, is Julia smart enough to know it doesn’t even need the 0 output?

probcomp / Gen.jl

Make Distributions Generative Functions #259