Expose the local posterior function (and its gradient) (and its deterministic version) to the inference programming language as repeatably callable

axch commented 9 years ago

The inference programming language currently has functions that expose the underlying trace API more or less the way it's implemented:

select :: scope -> block -> Action subproblem
detach :: subproblem -> Action (weight, rhoDB) (leaves a torus and returns the likelihood of the old state)
regen :: subproblem -> Action weight (fills a torus and returns the likelihood of the new state)
restore :: subproblem -> rhoDB -> Action weight
detach_for_proposal :: subproblem -> Action (weight, rhoDB) (the weight is the full local posterior)
regen_with_proposal :: subproblem -> [value] -> Action weight (inserts given values into the principal nodes and returns the value of the full local posterior of the result)
get_current_values :: subproblem -> Action [value]

This is enough to write things like mh with custom proposal distributions in the inference programming language; the tutorial even has an example. The next level, though, is to be able to write and debug gradient methods, or even better, use third-party gradient methods. Specific things that would be good to have:

A version of detach_for_proposal that returns the gradient of the weight wrt the values of the principal nodes (this should be easy) (name it detach_for_proposal_with_gradient?)
- Benefit: Can test gradients computed by the AD system against side information
A way to package up select-detach-regen as a (stochastic) function that can be called repeatedly (and, ideally, passed to a foreign thing, e.g. an optimization package).
- The difficulty with this is that detach and regen both mutate the trace and the subproblem object, in such a way that the subproblem object is not reusable after a detach-regen cycle. The slow solution is probably to just re-select the subproblem every time; that may be an acceptable proof of concept.
- Benefit: Can test gradients computed by the AD system against numerical estimates thereof, in the absence of brush or likelihood-free SPs
A way to package up select-detach-regen and a prng state as a deterministic function that can be called repeatedly (currently this involves the FixedRandomness object, but see #138 ).
- Benefit: Can test gradients computed by the AD system against numerical estimates, even in the presence of brush and likelihood-free SPs
A way to package up the gradient version as a stochastic repeatedly callable function
- Benefit: Can implement gradient ascent and HMC in the inference programming langauge, and possibly use foreign optimizers (e.g., a BFGS written in Python), in the absence of brush or likelihood-free SPs
A way to package up the gradient version as a deterministic repeatedly callable function
- Benefit: Can implement gradient ascent and HMC in the inference programming langauge, and possibly use foreign optimizers (e.g., a BFGS written in Python), even in the presence of brush and likelihood-free SPs

marcoct commented 9 years ago

The first thing that comes to mind is demonstrating that Venture can express common non-Bayesian ML algorithms. For example, demonstrating a regularized logistic regression using IRLS / Newton's method, or a neural network using backpropagation, would be nice examples. For some of these, second-order information (Hessian) would be required. If this direction is taken, perhaps we should consider provisioning for the function to return more general side-information beyond just the gradient (e.g. Hessians).

axch commented 9 years ago

The automatic differentiation system we have now cannot do nested AD, so Hessians would be an enormous amount of work. This issue is about packaging functionality we essentially already have in a convenient form rather than adding new functionality, but you're right that Hessians would be useful in principle.

axch commented 9 years ago

HS-Venture should be able to compute Hessians.

lenaqr commented 9 years ago

The difficulty with this is that detach and regen both mutate the trace and the subproblem object, in such a way that the subproblem object is not reusable after a detach-regen cycle. The slow solution is probably to just re-select the subproblem every time; that may be an acceptable proof of concept.

So actually, it looks like the GradientOfRegen object in hmc.py already implements the desired pattern in this case (actually regen-select-detach, on an already-detached state). Seems like it wouldn't be too difficult to expose that in the inference language as a proof-of-concept, separately from resolving the scaffold mess.

axch commented 9 years ago

Yes. There is a choice: do we expose (select-detach-regen) or (select-detach), (regen-select-detach), (regen)? I no longer recall quite why I did it that way for hmc; perhaps it had to do with wanting the rhoDB from the first detach, and with wanting to do the last regen with fresh randomness. I suppose we could even expose both.

lenaqr commented 9 years ago

It wants to be regen-select-detach, because detach returns the gradient which is used to update the values to use for the next regen.

axch commented 9 years ago

But select-detach-regen is much simpler to explain, and feels more natural externally. That means we should probably have both.

lenaqr commented 9 years ago

What would be the signature of select-detach-regen? regen-select-detach is pretty clearly (subproblem, values) -> (weight, gradient of weight), leaving the trace in the same state before and after. For select-detach-regen to fulfill the same use case, it would need to be a higher-order function that accepts an update function that produces new values to propose to regen given the current values and gradient. I guess that's not bad, although I would think it's less likely to work as well with external optimization packages that expect to be handed a function that they can call. (The other one would be a select-detach-regen that doesn't accept any values, so regen just proposes from the prior; not sure what there is to "package up" in that case though.)

axch commented 9 years ago

The trouble with regen-select-detach :: (subproblem, values) -> (weight, gradient of weight) is that it (currently) mutates the input subproblem. But the package can fix that, e.g. by returning an updated one, or mutating it back.

The select-detach-regen variant I was thinking of would accept values that do not depend on the weight or the gradient: (subproblem-spec, values) -> (weight_ratio, gradient of weight at values). (Or maybe both weights rather than just the ratio).

It is becoming clearer that we should just make all of these packages, and see which ones lead to convenient uses.

lenaqr commented 9 years ago

But the gradient returned by detach is the gradient at the old values, not the new ones, unless select-detach-regen is actually select-detach-regen-detach-regen.

axch commented 8 years ago

Interesting artifact I found in Marco's code: If the variables of interest are top-level, could use force to set them and log_joint_at to evaluate the posterior density. Doesn't give the gradient, though, and doesn't expose the "fixed randomness" trick.

axch commented 8 years ago

For the record, my project notes from the initial implementation of the inference SPs select, detach, regen, etc. This can be viewed as a sort of design document for the status quo.

Regen/Detach as inference SPs

can start with a restricted interface, permitting built-in operators to access the broader one. Eventually try to replicate/replace built-in facilities with inference programming
reusable scaffolds would help, perhaps after
enables generic Gaussian drift (finally)
- If we have generic Gaussian drift, I will feel better about killing the LKernel mechanism

Subgoals:

[x] Reproduce resimulation mh on a fixed scaffold
[ ] Reproduce resimulation mh on a randomly selected scaffold
[ ] Gaussian drift kernel

Initial limitations:

Lite only
One trace only
- Can add "distribute" or "scatter" to multiprocessing by passing explicit parallel lists and a command type tag to the worker
- cleans up the rest of the control channel too: tags can be "stop", "dump", "at", "map", "scatter", each with their own arguments.
Non-serializing parallelism mode only
Non-stochastic subproblems only

Would be nice if:

destructuring worked
subproblems did not mutate
could transmit this program to the workers to make their random choices independently from each other
subproblems, rho_dbs were serializable (?)

Imperative mh looks like this if the default regen is from the prior

(do (subproblem <- (select foo bar)) ; really, select by availability of log densities
    ((rho_weight, rho_db) <- (detach subproblem))
    (xi_weight <- (regen subproblem))
    (if (< (uniform ...) ...)
        ...
        (do (detach subproblem)
            (restore subproblem rho_db))))

With functional-underneath traces, we can have this

(do (subproblem <- (select foo bar))
    (original <- (copy_trace))
    (rho_weight <- (detach subproblem))
    (xi_weight <- (regen subproblem))
    (if (< (uniform ...) ...)
        ...
        (set_trace original)))

A candidate for custom proposals

(do (subproblem <- (select foo bar))
    (current_x <- ...)
    ((rho_weight, rho_db) <- (detach subproblem))
    ; somewhere need credit for the reverse proposal, rather than the prior
    (new_x <- (normal current_x 1))
    (correction <- ....)
    set x to new x
    (xi_weight <- (regen subproblem))
    (if (< (uniform ...) ...)
        ...
        (do (detach subproblem)
            (regen/restore subproblem rho_db))))

probcomp / Venturecxx

Expose the local posterior function (and its gradient) (and its deterministic version) to the inference programming language as repeatably callable #199