rethinkpriorities / squigglepy

Squiggle programming language for intuitive probabilistic estimation features in Python
MIT License
65 stars 8 forks source link

weird x*x sample behavior #47

Closed edoarad closed 1 year ago

edoarad commented 1 year ago

image

import squigglepy as sq
import numpy as np

x = sq.norm(0,1)
a = sq.sample(x*x, n=1000)
np.histogram(a, bins=10)
peterhurford commented 1 year ago

@edoarad Thanks! Can you elaborate on what you find weird about this?

edoarad commented 1 year ago

sure. so, ideally, x would only be sampled once each time and squared, so we'd get only positive numbers. An alternative implementation (mistaken imo) would sample independently from two copies, in which case we'd get a symetric dist around 0. Neither seems to be the case

On Sun, Jul 16, 2023, 22:46 Peter Wildeford @.***> wrote:

@edoarad https://github.com/edoarad Thanks! Can you elaborate on what you find weird about this?

— Reply to this email directly, view it on GitHub https://github.com/rethinkpriorities/squigglepy/issues/47#issuecomment-1637175450, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6OSLMAIKH6KFFKEX7OCTDXQRAK3ANCNFSM6AAAAAA2L6JKA4 . You are receiving this because you were mentioned.Message ID: @.***>

peterhurford commented 1 year ago

ideally, x would only be sampled once each time and squared

That doesn't work in Python because x is the distribution object, so each distribution object would get sampled independently. I don't think there's any easy way for the sampler to know that both x should be the same when sampled. However, the solution is pretty easy -- if you do sq.sample(x ** 2, n=1000) that should get you the desired behavior.

An alternative implementation (mistaken imo) would sample independently from two copies, in which case we'd get a symetric dist around 0.

That's not the case with squigglepy because sq.norm(0, 1) by default is a 90% CI between 0 and 1, so you get a mean of 0.5. The mean of your x * x distribution is thus 0.25, which appears to be the case based on your observed sampling. If you want to define the interval via mean + sd you need to do that explicitly via defining x = sq.norm(mean=0, sd=1).

So I don't see any bad behavior here, just a confusion between what squigglepy does and what you seem to think it does or want it to do.

edoarad commented 1 year ago

Yep, my mistake on the normal dist parameters..

I've recently talked with Ozzie about this, and I am now pretty sure that the correct behavior should sample only once for both x. It could resemble your work on correlated variables, I guess, but there are also simpler implementations I can think of for "correlation 1".

peterhurford commented 1 year ago

Thanks! I don't see a clear way to make the interface work the way you suggest, but we will look into this for our correlated variables interface.

In the meantime, another thing you could do is define the sample via a function:

def model():
    x = ~sq.norm(0, 1) # Use `~` to explicitly sample in advance
    return x * x # Both x will have the same value now

sq.sample(model, n=1000) # But each time the model runs the x will be different
edoarad commented 1 year ago

I don't think this is working image

peterhurford commented 1 year ago

@edoarad You have to explicitly sample the x first: x = ~sq.norm(-1,1) not x = sq.norm(-1, 1). Otherwise you get the distribution objects passed and you're back to the original problem.

edoarad commented 1 year ago

Seems like I just assumed something silly about ~ and didn't read the comment or try to use your code 🤦‍♂️ It works well, just as you say..

edoarad commented 1 year ago

Okay, in samplers.py you have _squigglepy_internal_sample_caches. I'm not sure what exactly it's intended for, but you could use it to implement this feature.

It's using str(dist) as keys, but if instead you use id(dist) it should work well. Then, you only need to refresh the cash each iteration

edoarad commented 1 year ago

(I thought about adding a dict agrument to sample, but there's no need to have it multiplied on the stack. I expect it not to hurt performance too much though.)

(also, I didn't think at all about how this is influenced by multiprocessing)