pyro-ppl / pyro

Deep universal probabilistic programming with Python and PyTorch
http://pyro.ai
Apache License 2.0
8.59k stars 987 forks source link

Alternative design of variable scoping and naming #502

Closed srush closed 6 years ago

srush commented 7 years ago

Curious why all the params are auto-magically global with a shared namespace. Seems like a very different design choice than the nice cleanly scoped pytorch variables.

eb8680 commented 7 years ago

We're definitely planning to add a dynamic scoping mechanism of some sort for both param and sample, it just didn't make it into this release. See #73 for some discussion about this.

srush commented 7 years ago

Sorry, but I don't really get this. The plan seems to be to copy TensorFlow's heavyweight design. PyTorch is so nice because it piggybacks on python's natural scoping rules.

Clarification: It would seems natural to me that params would behave like any other object in PyTorch (variables, params, etc.). You don't access it by a global string in some global state dictionary, but by actually having access to the variable. That way you don't break the standard coding abstractions in python. This also allows you to easily introduce anonymous terms and parameters and avoids ugly string manipulation for lookups.

eb8680 commented 7 years ago

For models that can be written as nn.Modules, where all parameters can be created before function evaluation, pyro.module will add the correct scope to the parameter names automatically.

Unfortunately, parameters in Pyro don't always work this way; consider e.g. per-datapoint variational parameters that are created on the fly each time a new data point appears, or parameters whose existence depends on a random variable.

sample statements are also problematic because names are used to identify corresponding random variables in models and guides, which may have arbitrarily different control structure. Lexical scoping for sample names would break this identification.

srush commented 7 years ago

Not sure I understand.

Why should you be able to sample a random variable that you don't have pointer access to? Needing to do that sounds dangerous.

Why does having non-global scoping prevent per-data point parameters? Presumably whatever object is creating the parameters should own them, not some global store.

How come these issues don't come up in pytorch? In theory you have to do many of the same things in standard deep learning, and yet there are no globals.

srush commented 7 years ago

(Sorry, this library is great, and I was thinking of using it for teaching. But I am really worried about this design decision.)

eb8680 commented 7 years ago

(Sorry, this library is great, and I was thinking of using it for teaching. But I am really worried about this design decision.)

No problem! The parameter store is definitely a weak point, and this is great feedback.

Why does having non-global scoping prevent per-data point parameters? Presumably whatever object is creating the parameters should own them, not some global store.

nn.Modules expose references to all of the parameters of their child modules, making it possible to completely separate the definition of losses that only interact with output and optimizers that only interact with parameters and gradients. Since Pyro models can be regular Python functions that call other regular Python functions, you can't in general automatically track parameters through the call stack in the same way, so it seems to me that you end up with something that looks like a parameter store.

I guess we could have a per-model parameter store instead, though, sort of like a dynamically generated nn.Module wrapper for a function with param calls. That would probably be much cleaner.

srush commented 7 years ago

I guess I get that, but now I'm much more worried about this type of name matching though. This feels very scary:

""To be very explicit, if the model contains a random variable z_1

def model(): pyro.sample("z_1", ...) then the guide needs to have a matching sample statement

def guide(): pyro.sample("z_1", ...) The distributions used in the two cases can be different, but the names must line-up 1-to-1."""

If model was a class with member z_1 = sample. Then this could be done correctly instead of asking the user to manage this alignment.

eb8680 commented 7 years ago

I guess I get that, but now I'm much more worried about this type of name matching though. This feels very scary though.

To compute the Monte Carlo Elbo log(P_model(observations, latents) - log(P_guide(latents)) where latents ~ P_guide, you need to be able to identify sample statements in the model with sample statements in the guide so that you can compute individual terms log(P_model(latent_i | ...)) - log(P_guide(latent_i| ...)).

If model was a class with member z_1 = sample. Then this could be done correctly instead of asking the user to manage this alignment

In Pyro, we made the choice to completely separate the definitions of model and guide and allow both of them to be callables with arbitrarily different structure and control flow. The user pays for this flexibility by having to manage things themselves, but could implement the safer functionality you describe on top of the more general behavior we provide.

srush commented 7 years ago

Okay, sure. I get it, I just disagree.

I would love to have a PyTorch style mode that looks like this:

class Model:
def __init__(self):
     self.z_1 = pyro.sample() # no string global name, can't be accessed without pointer

def forward(self):
       ....

def Guide:
def __init__(self):
     self.q_1 = pyro.sample() 

def forward(self):
    ...
m = Model()
g = Guide()

SVI2(variational= [[m.z_1, g.q_1]...])      

This seems clean to me, has no globals, and could even allow you to reuse guide functions that with differently named samples.

dustinvtran commented 7 years ago

This seems clean to me, has no globals, and could even allow you to reuse guide functions that with differently named samples.

This is actually the design in Edward (e.g., bayesian_linear_regression.py). This is difficult for dynamic graphs where it's harder to specify the alignment a priori as the sets of random variables can change during runtime.

String-matching is a sensible solution for Pyro. A similar manual alignment could, for example, assume identical class members in Model and Guide. More generally, the alignment could be an arbitrary callable passed into SVI that creates the dictionary/list thing. (Although clunky, I actually think this lattermost design makes sense for puritanical reasons in that the alignment is a feature during inference and not a feature during model/guide specification.)

srush commented 7 years ago

Could one of you give me an actual example that can't be handled with real variable scoping? Just asserting that this is reasonable is not convincing. It seems to me like PL 101 that a programming language should avoid having the user reinvent scoping using all global string lookups.

You say that this is a problem with dynamic functions. But PyTorch is a dynamic graph library that never requires the user to do this kind of black magic.

dustinvtran commented 7 years ago

An example of what we're discussing is air.ipynb. FYI I think it's very possible to solve with non-global scoping. Just not sure what the precise solution is and/or how practical it may be.

eb8680 commented 7 years ago

Manual alignment of different names in model and guide is a separate issue from scoping. It's actually possible now to do manual alignment in Pyro; that functionality is already implemented in ReplayPoutine, and could easily be exposed to SVI.

I guess I don't really understand what you're proposing. Scoping for samples is about names, not access; sample statement inputs and outputs are lexically scoped just like function applications.

Consider the following rather contrived example:

def chain(x, t):
   if t == 0:
        return pyro.sample("{}".format(t), dist.normal, torch.zeros(1), torch.ones(1))
   else:
        return chain(pyro.sample("{}".format(t), dist.normal, x, torch.ones(1)), t-1)

def model(x, t):
    mu = chain(x, t)
    return pyro.sample("y", dist.normal, mu, torch.ones(1))

def mean_field_guide(x, t):
    latents = []
    for i in range(t):
        latents.append(pyro.sample("{}".format(t), dist.normal, pyro.param("m_{}".format(t)), torch.ones(1)))
    return latents

How should the sample names be scoped automatically without changing the model code? How should an inference algorithm align the guide samples with the model samples?

srush commented 7 years ago

@dustinvtran Not sure I understand. I'm arguing that global scoping is not necessary.

@eb8680 I see these issues as very connected. The fact that string alignment is done this way is a warning sign symptom of a programming language that is using a problematic internal scoping model.

I made my proposal above. If you want to expose something to an inference algorithm, you should have to have a reference to it, not a string in a global namespace.

Innumerable programming languages thought it was a good idea to have global string references (early php, javascript, etc.) but that always turns out to be a bad idea, and we think of them bad languages to work with. It seems early enough that pyro can avoid this issue.

eb8680 commented 7 years ago

@srush I updated my example to make my point clearer. I really do appreciate your feedback; it's very helpful for thinking through these issues.

If you want to expose something to an inference algorithm, you should have to have a reference to it, not a string in a global namespace.

The sample namespace isn't global, it's local to the model. It seems to me that your proposal just moves the awkwardness from naming into alignment, since that can no longer be done automatically when the model and guide have different scope structures.

srush commented 7 years ago

It seems effectively like a global to me. You access it by constructing its global string name, not by having a reference to it. Compare this to the way PyTorch params respect the natural scoping rules of python. You cannot magically access them, you need a reference. This is how I expect non-global variables to work.

I think I understand your example. How about this?

class Model:
    def __init__(self):
        self.mus = pyro.LatentList(dist.normal) # no string global name, can't be accessed without pointer
        self.x = pyro.Latent(dist.normal)

   def forward(self):
           mu = self.mus.next_sample(torch.zeros(1), torch.ones(1))
           for i in range(t):
               mu = self.mus.next_sample(mu, torch.ones(1))
           return self.x.sample(mu, torch.ones(1))

def Guide:
    def __init__(self):
        self.q = pyro.LatentList(dist.normal) # no string global name, can't be accessed without pointer

   def forward(self, t):
       return [self.q.next_sample(pyro.param()) for i in range(t)]

m = Model()
g = Guide()

SVI2(variational= [[m.mus, g.q]...])      

Or alternatively you could have a pyro.Plate for when the samples are exchangeable.

martinjankowiak commented 7 years ago

we've talked about doing things of this sort (LatentList etc.) but more so as an additional construct for helping the user deal with names than as a total replacement. my concern would be that some use cases won't be adequately covered using some set of built-in constructs like LatentList and we'd end up with an ever expanding zoo of such constructs. for example, one of our inference tests uses a conjugate gaussian model with the structure of a N-layered pyramid. see this figure for N=3:

https://gist.github.com/martinjankowiak/af061c7a00e86e2874a217a0853efb2c

on the left is the model and on the right is (one possible factorization) of the guide. the way in which i need to sample things in the guide is quite different from the natural ordering in the model. constructing the guide and aligning it with the model using some generalization of LatentList might be a bit unwieldy and awkward. it might also be tricky to write multiple guides with different factorizations for the same model. of course, having the user provide explicit names can be tedious too, but at least he or she has the flexibility to support all possible model/guide structures/alignments.

we all agree that we need to think through how naming works more carefully, and this is certainly a good time to be doing it. i think it's definitely worth taking a look at various constructs like LatentList. i'd want to see that they can provide all the flexibility we want before making any breaking changes to how names work in pyro. in any case thanks for your suggestion: these are exactly the kinds of conversations we were looking forward to having once we opened up the repo : )

srush commented 7 years ago

Thanks Martin. That's helpful to know.

Let me reiterate my point once more though, because I feel it is being lost. My argument is not for a particular style of handling locals (I made up this LatentList thing on my phone), but it is against globals as a default option that is the "documented way to do it". If there are corner case models that really require string matching based alignment, then have a LatentDict object that is owned by that part of the model and can use whatever strings it wants. This gives the ability to "support all possible model/guide structures/alignments", sicne you can then do something similar in the Guide. This has all the functionality you want but does not requiring, what to me, feels like late 90's PHP query string construction.

fritzo commented 7 years ago

I think we all agree it would help to have more concrete examples 😃 To help explore syntax, I hacked together a pyro.anon module on an anonymous-sites branch (never intended to be merged). This has a single example examples/anonymous.py that actually runs in Pyro. @srush does this look any closer to your ideal syntax? If so, would you be able to provide a more complex example? If not and I've missed your point, perhaps we should all meet over video conference to help clarify.

srush commented 7 years ago

Yeah! This is perfect, exactly what I imagined. Maybe can be bit less verbose, but it checks all my boxes. Also it is heighrarchical, so you can easily nest groupings of latent variables without string building. (It's kind of a statemonad in Python).

My one quibble is that it still matches the guide variables by name match, but otherwise way better.

eb8680 commented 7 years ago

@fritzo that's pretty neat. It might make sense to implement a LatentObject layer on top of nn.Module, so that those sorts of models can inherit Module parameter handling directly.

@srush we'll definitely consider including something like that as we iterate on naming and scoping, though I see it existing alongside with rather than replacing the current syntax, much like torch.nn exists alongside the raw autograd API.

srush commented 7 years ago

So just imagining a bit how you could make this even more friendly.

1) Maybe this is too much, but you could have a @pyro decorator that wraps this class thing, and returns the object with all the contained variables. That way it makes it easy to build up models as little blocks. 2) Since these functions are lazy. I am imagining that the explicit alignment in SVI could be done by passing in a lambda. 3) Since models are stacked. You can pretty print all the variables based on how they are grouped in the code.

@pyro
def TimeSeries(model, obvs):
        model.sigma = anon.param(Variable(torch.ones(1), requires_grad=True))
        model.steps = []
        for obv in obvs:
            prev = models.step[-1] if model.steps else Variable(torch.zeros(1))
            model.steps.append(TimeStep()(obv, models.steps[-1], model.sigma))

@pyro
def TimeStep(model, x, prev, sigma):
       model.y = anon.sample(dist.normal, prev, sigma))
       model.x = anon.observe(dist.normal, x, prev.y, sigma))

@pyro
def Guide(model, obvs):
        model.mus = [anon.param(Variable(torch.zeros(1), requires_grad=True)) for _ in range(len(xs))]
        model.sigmas = [anon.param(Variable(torch.ones(1), requires_grad=True)) for _ in range(len(xs))]
        model.ys = [anon.sample(dist.normal, model.mus[i], model.sigmas[i]) for i in range(len(xs))]

def main(args):
    model = Model()
    guide = Guide()
    optim = Adam({"lr": 0.01})
    inference = SVI(model, guide, optim, loss="ELBO", 
                              align=lambda m, g: return [([s.y for s in m.steps], g.ys) ])
    data = Variable(torch.Tensor([0, 1, 1, 0, 1, 2]))
    for step in range(args.num_epochs):
        if step % 100 == 0:
            loss = inference.step(data)
            print('{}\t{:0.5g}'.format(step, loss))
    print('Parameters:')

    # Will automatically collect all the sub variables and 
    for name in model.get_all_variable_names():
        print(name)
srush commented 7 years ago

haha @eb8680 I think you have to make a choice :) torch.nn is not a great analogy, as it is entirely built on top of torch.autograd. Building a nice well scoped library using anonymous names in a global store with scary functions like clear_param_store() will be dangerous.

dustinvtran commented 7 years ago

Since these functions are lazy. I am imagining that the explicit alignment in SVI could be done by passing in a lambda.

I agree. To reiterate my last comment:

More generally, the alignment could be an arbitrary callable passed into SVI that creates the dictionary/list thing. (Although clunky, I actually think this lattermost design makes sense for puritanical reasons in that the alignment is a feature during inference and not a feature during model/guide specification.)

srush commented 7 years ago

Oh yes @dustinvtran I agree. I was mostly responding to a change to @fritzo 's sample code. My main mission here is to kill the global variable store. I see these string alignment sub-issues as symptoms of the global variable store.

fritzo commented 7 years ago

@srush I like your decorator idea. I've updated examples/anonymous.py to use an @anon.function decorator similar to your @pyro. The resulting modeling language looks good but has lots of sharp edges.

srush commented 7 years ago

Nice! This is looking great, let me play around a bit and try some of the examples.

srush commented 7 years ago

Oh wait, I see @fritzo . So the LatentList thing is only there to ensure that all of the anon's have names that match their variables? So it automatically calls them "x[0]" "x[1]" etc.?

What's the argument for why that is necessary? Now that you have actual objects, why not just give them a random address?

pyro.sample(uuid.uuid4(), ...)

Am I missing something?

fritzo commented 7 years ago

Could we call a subfunction and assign it's returned latent to a member?

Hmm assignment from a return value doesn't generally work yet (since the returned latent must be attached to the top-level model before* assignment). But it does work to let subfunctions mutate parts of the latent:

@anon.function     # Decorator is only used for top-level models.
def model(latent, size):
    latent.params.mu = Variable(torch.zeros(1))
    latent.params.sigma = anon.param(Variable(torch.ones(1))
    latent.samples = []
    my_subfunction(latent.params, latent.samples, size)

def my_subfunction(params, samples, size):
    for i in range(size):
        samples.append(anon.sample(dist.normal, params.mu, params.sigma))
fritzo commented 7 years ago

Am I missing something?

One crucial behavior you might be unaware of is that pyro.sample() statements execute differently depending on their name. In the model, we need to know the name of a sample statement before executing it, so that if it was sampled as part of the guide we can use that value instead.

srush commented 7 years ago

oh! I see, that makes a lot of sense. That also explains a lot of the decisions I didn't understand.

jpchen commented 7 years ago

yep, refer to the elbo implementation to see how the 'alignment' occurs under the hood. in _get_traces() we get the trace of both the model and the guide, and then we use the value sampled from the guide (in the guide trace) to calculate the ELBO. this matching happens via unique sample names

srush commented 7 years ago

Hmm, so given that this swap happens through some cps/poutine trick that is invisible to the user and I assume happening at a global level, maybe in the anon version latent should also be the object that controls sampling?

latent.y = latent.sample(dist.normal, prev, sigma)

That way the user can grok that latent may have different semantics depending on its caller, versus anon.sample which looks harmless.

(Or alternatively through a cxt variable also passed in, but maybe that is too much)

fritzo commented 7 years ago

Here's a similar notation that works: examples/attribute.py. Feels very protobuffy.

srush commented 7 years ago

I like it.

The object setting is a little too weird for me though :) Maybe go even more protobuffy?

 latent.mus = LatentList() # 
or 
 latent.mus.list() # 

....
 latent.mus.add().param(Variable(torch.zeros(1), requires_grad=True))

I think dict could work out of the box as well.

     latent.qs = LatentDict()
or 
 latent.mus.dict() # 

...
     latent.qs["new"].param()

And actually at this point you could remove the

@anon.function and just have an anon.SVI that calls model(Latent(), ...)

srush commented 7 years ago

@fritzo Okay, I went through and ported all the documentation to use anon. (I mostly did the intro and svi. a lot of the other documentation is not runnable. )

https://github.com/harvardnlp/pyro/tree/anonymous-sites/tutorial_ette/source

Mostly it worked great. Here are the remaining issues I saw:

alpha_q = torch.exp(pyro.param(".log_alpha_q")).data.numpy()[0]

Similarly .marginal seems to return a dictionary. But it should instead return a (maybe read only) Latent object.

conditioned_scale = pyro.condition(
    scale, data={"measurement": Variable(torch.Tensor([8.5])), "xs_1": Variable(torch.Tensor([1]))}) 

I think the anon version should have these string names be not globals, but localized to the function (scale). Also we should be able to pass in arrays and reference nested modules "submodel.mu".

Maybe for complicated data, there should be a data_fn=obs where obs is a function that takes latent and fills in variables it wants.

srush commented 7 years ago

Okay, fixed most of these issues in my branch:

https://github.com/harvardnlp/pyro/blob/anonymous-sites/pyro/anon.py

https://github.com/harvardnlp/pyro/blob/anonymous-sites/tutorial_ette/source/intro_part_i.ipynb https://github.com/harvardnlp/pyro/blob/anonymous-sites/tutorial_ette/source/intro_part_ii.ipynb https://github.com/harvardnlp/pyro/blob/anonymous-sites/tutorial_ette/source/svi_part_i.ipynb

There is some more to do, but I think this already way cleaner, maintains scope, makes the semantics cleaner, supports real arrays and nesting, and uses less code.

fritzo commented 7 years ago

Trying to find a compromise that we could merge soon: #540

eb8680 commented 7 years ago

@srush thanks for your ongoing feedback. This issue keeps growing in scope, and it would be really helpful if you could open new issues about things like inadequate documentation for iarange and random_module.

As I said, you've made some great points about the parameter store, whose design was largely borrowed from webPPL, and changes like the ones you've sketched seem like clear wins. We're also more than willing to include an anon-like module ( #540 ), provided it's properly fleshed out.

However, let me try to give some more background on the way sample names work to explain why a similar solution is not an unambiguous improvement there:

The biggest issue I saw was with the condition function. That function is cool, and useful. But it needs to give the observations as a dict with strings...

There's not really any way around this, and condition is not just a cool feature that can be ignored or removed; it or something like it is essential for implementing advanced inference algorithms like EP or particle Gibbs with arbitrary guides.

There is a similar issue in the sites argument of some of the inference algorithms. Not sure how to get around a string argument there

You've also identified the same fundamental problem with your model-guide alignment solution: in general, aligning execution traces or observations with an arbitrary Pyro program can be as hard as executing the program, since you need to refer to variables before they exist. As @fritzo said:

In the model, we need to know the name of a sample statement before executing it, so that if it was sampled as part of the guide we can use that value instead.

A corollary to this fact is that in Pyro's current incarnation, sample names don't really correspond conceptually to Python variable names and scopes, and there's no particular reason that they should unless the model and trace being aligned already share the same structure; think of them instead as tags or database queries attached to stochastic function applications (where the function and its return value could both be anonymous Python variables) that solve the problem of alignment by construction.

This design choice is rooted in both practical considerations and historical context. A particularly common workflow in Pyro is to write a single model, then write multiple guides specific to that model with different structure and one or more condition/do queries and combine these ingredients with SVI. We don't particularly like string manipulation either, but Pyro's sample names mean that, rather than effectively re-expressing the model's control flow for every guide and condition query in the form of a new alignment, we only need to write it once. Other naming/addressing schemes, like the one in anon, can only move complexity around to different places, and we feel that the current approach works well in the use cases we've explored in depth.

Historically, previous universal PPLs like Church, webPPL, Venture, and Anglican handled alignment by automatically generating name strings based on call stack position (which could be done easily in Pyro with Python's advanced runtime introspection facilities, e.g. a memoized version of traceback.extract_stack()). Guides were handled by threading the guide code through the model code (as in webPPL) or by simply not allowing arbitrary guides. condition is also a new construct for a universal PPL; previously observations were handled by threading data through models.

Building up names manually through Pyro programs instead of data and guides is one natural generalization of these previous approaches. It cleanly separates the specification of model, data, and inference, and provides a clear point of contact for more advanced techniques (like code transformations, static analysis, alignment specifications like your proposal that completely decouple model and guide names, or the current implementation of anon (!)) that reduce the burden of manual naming when possible.

As @dustinvtran said, Edward's behavior is closer to what you have in mind. This is precisely because it is possible to refer to nodes in a TensorFlow graph before you know their values, which also limits Edward's expressivity. Indeed the alignment mechanism you added to your anon example, which uses simultaneous references to model and guide variables with no strings, is only possible if anon is placed under similar constraints, such that it would effectively be identical to Edward (albeit with lexical scope rather than TensorFlow's dynamic scope).

Ultimately, we only see Pyro as one hypothesis in the space of possible language designs. It may well turn out to be a bad hypothesis - for example, Edward has clearly demonstrated the utility of static graphs as a probabilistic modelling language, and maybe in practice the benefits outweigh relatively mild restrictions on expressivity. That said, our development process was driven by the interaction of our design principles, properties of previous PPLs, and our tests and anchor examples, and we found that what we have strikes a good balance. Larger changes to the syntax and internals of Pyro represent different hypotheses, and should generally be justified on the basis of similar evidence.

srush commented 7 years ago

Wait, sorry, I'm confused by this response. I think the last anon solution I posted and wrote examples for captures everything you would like, i.e. keeping around arbitrary condition statements (with strings) and the alignment of guides, while also providing a syntax that doesn't require the user to reason about a global store or global CPS. Which part of that is a failure? Particularly you state "is only possible if anon is placed under similar constraints, such that it would effectively be identical to Edward" but I haven't seen an example yet that really needs what you are referring to.

In particular, I went through the exercise of rewriting this file. I think it is reasonably clean looking, and handles the condition case you mention:

https://github.com/harvardnlp/pyro/blob/anonymous-sites/tutorial_ette/source/intro_part_ii.ipynb

As I have said above, I have no problem with string-lazy access. My problem is with default global variables as the recommended way you are advising users to use the library. I really like the dynamic design of PyRo, I agree with your design goals. I just feel that these things do not need to be handled globally but can be built up explicitly in the functions by mirroring the semantics of python.

eb8680 commented 7 years ago

Which part of that is a failure?

There's no failure there! I'm just trying to point out that anon is a different point on the Pareto frontier induced by the fact that models, guides, and data can be arbitrarily different.

srush commented 7 years ago

I meant in the sense of "what flexibility is it giving up?". Different point implies there are negatives, but don't see how it is more constrained. So far we are able to do all the tutorial examples using it while removing global variables (which I think we can agree are negative).