pyro-ppl / pyro

Deep universal probabilistic programming with Python and PyTorch
http://pyro.ai
Apache License 2.0
8.49k stars 982 forks source link

[feature request] Implementing TrustVI #2768

Open velezbeltran opened 3 years ago

velezbeltran commented 3 years ago

Hey!

I am the guy from the post here: https://forum.pyro.ai/t/possible-help/2169/4. We are planning on implementing TrustVI and we made a document with a couple of questions, goals, a rough idea of what to do, and a brief summary of the algorithm at the end. It would be quite helpful if we could get some feedback on our rough goals.

https://docs.google.com/document/d/1wll-yJCKc5RUku1etscg6m_EGU9rDQZrnhGBdQDNwHs/edit?usp=sharing

I apologize if there are any mistakes in the document. We aren't yet with all of the inner workings of Pyro (though we are trying and hopefully will be at some point!)

fritzo commented 3 years ago

Hi @nb2838, glad to hear you're implementing this in Pyro! I can answer directions directly on your doc if you turn on commenting. Here are some answers:

The algorithm in the paper uses the reparameterization trick. From our understanding Pyro uses a Reinforce estimator if not, where is the reparametrization trick implemented?

Pyro uses the reparametrization trick for any distribution that implements an .rsample() method (a reparametrized sampler). Pyro's ELBO implementation is kind of distributed all around the codebase. From the top down, Trace_ELBO._differentiable_loss_particle() computes the elbo, assuming get_importance_trace() has already traced the model and guide and computed log_prob and score_parts and stored them in a trace object. You can take a look in each site in a trace via e.g. model_trace.nodes["x"]. The logic deciding whether each site can be reparametrized happens in Distribution.score_parts; this is a little subtle because Pyro allows partially-reparametrized distributions; I would recommend starting by assuming all distributions are fully reparametrized.

How should we deal with computing the Hessian?

I'd start by following AutoLaplaceApproximation, which uses hessian(-,-) from pyro.ops.hessian. The loss input to hessian(-,-) can be the output of TraceEnum_ELBO.differentiable_loss(), which should in principle be infinitely differentiable (I'm not sure whether Trace_ELBO.differentiable_loss() is twice differentiable, so I'd recommend sticking with TraceEnum_ELBO). The param inputs to hessian(-,-) should be unconstrained parameters. You can get these via

params = [site["value"].unconstrained() for site in trace.nodes.values() if site["type"] == "param"]

The paper suggests using Krylov Methods but Newton’s method in Pyro seems to do this computation directly.

Maybe Krylov methods will work better in high dimensions? :shrug: We wrote Newton for huge batches of 1-3 dimensional Hessians (effectively a block diagonal Hessian). If you're interested in Hessians of highly correlated models, maybe @fehiepsi or @martinjankowiak has an idea?

Should we stick to 1-3 dimensions as Newton’s method does?

Definitely in the first PR, just stick to analytic Hessian in low dimensions. Then you can add high-dimensional Krylov methods in a follow-up PR.

Generally:

martinjankowiak commented 3 years ago
velezbeltran commented 3 years ago

Thank you for being so detailed with your answers. It helps a lot! We will start working on it then.

As per your suggestions, we will start implementing trust-vi for a model similar to the one we plan to use in the tutorial and we will write the general method later. Also and we will not focus on Krylov subspace methods for the moment either. As soon as we have made progress we will make a PR 😁

velezbeltran commented 3 years ago

Hello! I am sorry for the delays. Unfortunately most of the people working on this project had to take additional coursework by the end of April so we are restarting work right now. We can work on it for the next months but as I saw that you have the help wanted label I understand if this too late.

fritzo commented 3 years ago

@nb2838 no hurry, feel free to take as long as you like. I've tagged this help wanted just to signal that "no core Pyro contributor plans to work on this issue", so by all means feel free to take ownership. Here: I'll assign it to you 😄

velezbeltran commented 3 years ago

Hello,

So we are running into a small issue that we don't exactly know how to solve. The algorithm works by constantly evaluating each proposed step in the parameter space by checking the difference between the old elbo and the new elbo. image The issue is that according to the algorithm both elbos should use the same samples from the base distribution that was used for the reparametrization of the guide (above they are denoted e_{ki}) . (perhaps this picture clarifies a bit the notation above and what I am referring to

image )

Do you know how to do this in Pyro or could you give us some tips on where/how to implement this.

Thank you!

fritzo commented 3 years ago

@nb2838 IIUC you need the underlying randomness to agree between two elbos, but the parameters to differ? I believe you can do this in Pyro for a subset of distributions: those that can be reparametrized to be explicitly differentiable functions of explicitly parameter-free Pyro distributions. (Counterexamples include Gamma, Beta, and Dirichlet, whose "reparametrized sample" .rsample() methods use implicit reparametrized gradients).

Here's a possible design approach:

1. Allow users to exiplicitly reparametrize their model via [poutine.reparam](https://docs.pyro.ai/en/dev/infer.reparam.html). Possibly even provide a `full_reparam()` wrapper that fully reparametrizers a model as parametrized transforms of parameter-free noise. 2. In your inference algorithm, use [poutine.trace](https://docs.pyro.ai/en/dev/poutine.html#pyro.poutine.handlers.trace) and [poutine.condition()](https://docs.pyro.ai/en/dev/poutine.html#pyro.poutine.handlers.condition) to record the noise and play it under two parameter values. (I'm unsure how you'll want to thread through the parameters): ```py # Construct a pair of guide traces with the same underlying randomness. old_params = {...} new_params = {...} old_guide_trace = poutine.trace(guide).get_trace(old_params) fixed_noise = { name: site["value"] for name, site in old_guide_trace.iter_stochastic_nodes() } for name, value in fixed_noise.items(): if value.requires_grad: raise ValueError(f"Site {repr(name)} is not fully reparametrized!") new_guide_trace = poutine.trace( poutine.condition(guide, fixed_noise) ).get_trace(new_params) ``` Now it's easy to use `poutine.trace` and `poutine.replay` to construct new elbos ```py old_model_trace = poutine.trace( poutine.replay(model, old_guide_trace) ).get_trace(old_params) old_loss, old_surrogate_loss = Trace_ELBO()._differentiable_loss_particle( old_model_trace, old_guide_trace ) ``` and similarly to get the `new_surrogate_loss`.
velezbeltran commented 3 years ago

Thank you @fritzo. What you mention is exactly what I was looking for. I am not familiar with the poutine.reparam effect handler so I'll take a look at that and use it for that particular step of the algorithm.

fritzo commented 3 years ago

@nb2838 Sounds good. I think it's safe to first implement inference with a check (the ValueError above) that assumes models have been manually reparametrized, and then in follow-up PRs we can work on automatically reparametrization strategies.

velezbeltran commented 3 years ago

Wanted to provide an update on this. We have a rough implementation of the algorithm for a simple model. However, it relies on using dogleg method to solve the minimization of the trust-region sub-problem. Unfortunately, this doesn't work well. Currently, we are trying to implement some other minimizer for the trust-region subproblem.

velezbeltran commented 3 years ago

Hello, We will need to drop this project so feel free to assign it to someone else. I apologize for the inconvenience.