pyro-ppl / pyro

Deep universal probabilistic programming with Python and PyTorch
http://pyro.ai
Apache License 2.0
8.55k stars 987 forks source link

Adjust ELBO so that baselines don't have to output scaled costs #555

Open null-a opened 6 years ago

null-a commented 6 years ago

533 removed a superfluous scaling factor from the LR term of the ELBO estimator by undoing the scaling of log_q. I propose that we achieve this by undoing the scaling of downstream_cost instead, like this.

Doing so does not change gradient estimates, but it does rescale the values we're asking baseline nets to output.

For example, here is some debug output from an inference run of the AIR example. (With a batch size of 2.) It shows the output of the baseline network at various points (output: ...) as well as the values we'd like it to output (target: ...).

``` target: [0.0, 8691153.0] output: [0.0, 33.2] i=100, epochs=0.00, elapsed=0.00, elbo=-271.15 target: [0.0, 0.0] output: [0.0, 0.0] i=200, epochs=0.01, elapsed=0.01, elbo=-150.59 target: [7280237.5, 0.0] output: [88.6, 0.0] i=300, epochs=0.01, elapsed=0.01, elbo=-477.05 target: [0.0, 21047686.0] output: [0.0, 114.2] i=400, epochs=0.01, elapsed=0.02, elbo=-704.59 target: [0.0, 21167150.0] output: [0.0, 140.4] i=500, epochs=0.02, elapsed=0.02, elbo=-627.35 target: [16254318.0, 0.0] output: [164.9, 0.0] i=600, epochs=0.02, elapsed=0.02, elbo=-624.58 target: [0.0, 12181961.0] output: [0.0, 189.0] i=700, epochs=0.02, elapsed=0.03, elbo=-557.61 target: [0.0, 0.0] output: [0.0, 0.0] i=800, epochs=0.03, elapsed=0.03, elbo=-586.60 target: [21169508.0, 0.0] output: [235.7, 0.0] i=900, epochs=0.03, elapsed=0.04, elbo=-448.33 target: [0.0, 21032276.0] output: [0.0, 260.1] i=1000, epochs=0.03, elapsed=0.04, elbo=-704.91 ```

As you can see, the target values are large, and even after 1K steps the net's outputs aren't of the expected order.

After making the suggested change however, things look more sensible:

``` target: [0.0, 229.6] output: [0.0, 32.9] i=100, epochs=0.00, elapsed=0.00, elbo=-144.65 target: [0.0, 610.7] output: [0.0, 61.8] i=200, epochs=0.01, elapsed=0.01, elbo=-688.94 target: [290.0, 587.9] output: [86.5, 86.5] i=300, epochs=0.01, elapsed=0.01, elbo=-516.98 target: [590.7, 0.0] output: [109.8, 0.0] i=400, epochs=0.01, elapsed=0.02, elbo=-591.06 target: [532.3, 566.2] output: [132.9, 132.9] i=500, epochs=0.02, elapsed=0.02, elbo=-679.37 target: [0.0, 516.3] output: [0.0, 153.9] i=600, epochs=0.02, elapsed=0.02, elbo=-492.29 target: [0.0, 405.7] output: [0.0, 175.5] i=700, epochs=0.02, elapsed=0.03, elbo=-556.48 target: [237.1, 491.8] output: [194.0, 194.0] i=800, epochs=0.03, elapsed=0.03, elbo=-554.03 target: [0.0, 487.9] output: [0.0, 212.6] i=900, epochs=0.03, elapsed=0.04, elbo=-705.26 target: [470.2, 470.2] output: [233.2, 233.2] i=1000, epochs=0.03, elapsed=0.04, elbo=-697.53 ```

This change also has the effect of making the scale of the baselines targets invariant to the batch size. I'm not sure there are any immediate practical consequences of this, but this seems like a desirable property.

One objection to this change might be that we ought to simply increase the step size instead. This would work, but I suspect that using large targets and correspondingly large step sizes will make it more difficult to compare baseline step sizes with published results. So for that reason, I think making this change is a better approach.

(EDIT: updated to use Github's <details> tags)

martinjankowiak commented 6 years ago

@null-a is 'target' the downstream cost for one sample or is it something else?

null-a commented 6 years ago

Yeah, it's downstream_cost. (ETA: for a single sample)

martinjankowiak commented 6 years ago

one thing that's kind of unnatural about that is that in the general case the downstream cost may contain nested subsamples and so multiple scale factors.

maybe instead of being (possibly over-)clever maybe it'd make sense to have SVI take a normalizer argument that is used to scale the entire loss?

fritzo commented 6 years ago

@null-a This change makes sense to me, in that your new baselines are independent of batch size. (so e.g. you could in theory do distributed training on two different machines with different bactch sizes but with shared baselines).

  1. Do you plan to submit a separate PR for this? If so, it would be good to keep this code in sync with Trace_ELBO.
  2. Do you think this should be included in our 0.1.2 patch release?
ngoodman commented 6 years ago

the baselines targets invariant to the batch size

this seems very desireable to me: i can easily imagine fancy training setups where we change the batch size during training, and it'd be sad if this gave us unuseable baselines.... (this makes me wonder about other cases where we might want reuseable baselines, like missing observations.)

null-a commented 6 years ago

one thing that's kind of unnatural

@martinjankowiak Hmm, I don't think I follow, sorry. Could you maybe expand a bit on where this would be problematic?

Do you plan to submit a separate PR for this?

@fritzo: If we go ahead, yeah, I'll make a separate PR. I also agree about Trace_ELBO.

Do you think this should be included in our 0.1.2 patch release?

I don't think so. It doesn't seem urgent, and of course I'd like to understand @martinjankowiak's reservation before submitting a PR.

martinjankowiak commented 6 years ago

well, i mean something like this.

surrogate_elbo = elbo + sum_z log q(z) x DSC(z)

where the z are the non-rep latents and DSC is the downstream cost at node z.

if there's nested subsampling so that there are various scale factors they will appear in the terms in elbo as well as in the terms in DSC(z). so unless the scale factors appear everywhere and are the same everywhere (like in a model with only local latent variables and no globals and only one layer of subsampling like AIR), i don't think your change would actually yield invariance of baseline under batch size. in general the baseline is integrating over information at different scales (there are multiple batch sizes) and so i don't think it can be made invariant like that (unless the baseline is built with a particular architecture to reflect the multi-scale structure of the sum of costs it's tracking?).

that said, i think your change is reasonable. i just suspect we might end up needing to revisit it down the line. thus my suggestion: if the automagic thing isn't going to be quite right in all cases, maybe the user should just decide how he or she wants things scaled.

null-a commented 6 years ago

@martinjankowiak: Thanks for that, I appreciate you taking the time to explain! However, I'm afraid I still don't see where this goes wrong. :(

Here's a rough attempt to explain why I suspected it would work. Maybe seeing this will make it clear where I'm going wrong!

Imagine this model/guide structure:

def model():
    with pyro.iarange("", 2, subsample_size=b_1) as i1:
        pyro.sample()
        with pyro.iarange("", 2, subsample_size=b_2) as i2:
            pyro.sample()

Here's a picture of the dependencies we track, with * showing sample statements and o showing iarange.

      o
     / \
    /   \
  (*)    *     costs at this level scaled by m_1 = 2/b_1
   |     |
   o     o
  / \   / \
 *   * *   *   costs at this level scaled by m_2 = m_1 * 2/b_2

We scale log_pdfs and therefore 'costs' depending on the level of nesting. The scaling factors at each level are on the picture.

For simplicity imagine that the local cost (log(q/p)) at each node is c.

As an example, focus on the (*) node, then its downstream_cost (the thing the baseline is currently trying to output) is:

downstream_cost for (*) = m_1 * c + m_2 * b_2 * c

The b_2 comes from the fact the we incur a c for each sub-sample we take.

If we divide this by the scaling factor at (*)s level:

scaled_cost = (m_1 * c + m_2 * b_2 * c) / m_1
            = (m_1 * c + m_1 * (2 / b_2) * b_2 * c) / m_1
            = c + 2*c

Then we get something which isn't dependent on the batch size.

I think something similar applies as you extend out with more levels of nesting. e.g. At the next level out you'd encounter b_2 * b_3 choices, but this term disappears from downstream_cost because the scaling factor at that level is m_1 * 2/b_2 * 2/b_3. Any idea what I'm missing?

martinjankowiak commented 6 years ago

@null-a do you think this works for models and guides with different structures?

null-a commented 6 years ago

do you think this works for models and guides with different structures?

@martinjankowiak Ah, good question! (I hadn't considered that.) My initial impression is that if it's the case that sub-sampling requires a particular random choice to show up in the same (potentially nested sequence of) iarange in both the model and the guide, then it will probably work out. Since this similarity means that the picture I drew earlier still makes sense, and ~the same~ a similar argument goes through even if there is arbitrary internal structure in each *.

(Clearly this would need more thought before we considered implementing this.)

fritzo commented 6 years ago

@martinjankowiak I totally forgot about this issue; IIRC we wanted to get it in before 0.2. I assume it's now too late, correct?

martinjankowiak commented 6 years ago

well, i'm not sure if there is a clear automagical solution here. or whether a tailored solution is really necessary. for example, if you're worried about big numbers one thing you can do is wrap everything with a scale decorator

@poutine.scale(scale=0.001) or @poutine.scale(scale=1/batch_size)