quantified-uncertainty / squiggle

An estimation language
https://squiggle-language.com
MIT License
157 stars 23 forks source link

mixture containing one point doesn't correctly sample for it, nor generate adequate statistics #1589

Open NunoSempere opened 1 year ago

NunoSempere commented 1 year ago

I.

See the summary statistics for:


mixture(0)

II.

See the samples for:

sampleN(mixture(0), 10)
berekuk commented 1 year ago

Still doesn't work correctly even with #1591, unfortunately.

berekuk commented 1 year ago

I'd like to flag this as pretty important.

I previously thought that it's not a big deal, because who cares if it's exactly zero or 1e-16.

But then I tried to use mx to simulate multiple discrete branches, and got confused why my code wasn't working properly.

Captura de pantalla 2023-10-06 a la(s) 14 56 51

https://www.squiggle-language.com/playground?v=0.8.5#code=eNptjj0LwjAURf%2FKJVMLfpCMQl106dzROKT2FQNJWpqIgva%2Fm5qKCi6Bc3Pv4d1ZQ626mLDrGmIbdlKeyoajAJcugYggpHujj2hv2WFuLjC3jvl3Z7lFpWxvqKKwaofO7rUPMZXuE1vVZ9IB9%2BkBHmn7SKTbWYyiwPuqcCYHjvUa2mEgZXCKZ8dYe1y7i2lQE2yMDM%2Fy5CHj6Z9MJJmYZK%2BF%2BFlIVtq%2B817XhiSbPkbpcjY%2BARYzYbM%3D

OAGr commented 1 year ago

Yea, that makes sense. I'd imagine that this specific issue isn't dramatically difficult? Maybe 3-20 hours, could be worth it soon.

NunoSempere commented 1 year ago

Incidentally, the approach that squiggle.c takes here is, I think, more robust, so that these things don't happen, because it defines variables as samplers, which are conceptually clearer.

In fake pseudo-code, it is doing something like:

let a = () => 0
let b = () => sample_from_normal(1, 10)
let mixture = (samplers, weights) => {
   let normalized_cumsummed_weights = cumsum(weights) / sum(weights)
   let p = Math.random()
   let i = findFirstIndex(normalized_cumsummed_weights, x => p < x) || samplers.length
   return samplers[i]()
}
let resutlSampler = mixture([a,b], [1, 9])
let result = Array.from({length: 10000}, _ => resultSampler())
console.log(result)

You get more verbosity, but the type of objects which a and b are is a bit clearer, and there is no difference between a sampler which samples from discrete vs continuous distribution.

I don't know to what extent you will find this useful, but I thought I'd share.

OAGr commented 1 year ago

Yea, I think that approach can be good in certain use cases. You can do similar things in Squiggle if you want, though it's not as emphasized.

berekuk commented 1 year ago

Representing all distributions as generator functions that lazily return samples is a very tempting idea to me.

Benefits:

Risks/problems:

One more meta-level benefit is that if we do this, it'd be much clearer to me that Squiggle is necessary, and not just a trivial layer of syntax sugar on top of JS.


One less extreme way to try this, without changing the entire language and its semantics, is to add a 4th type of distributions: "GeneratorDist". I think it'd be mostly compatible with what we have right now: we could treat it as a lazy SampleSetDist at first, and pull sampleCount samples from it when it's first used. Then we could make it more lazy when it's possible.

OAGr commented 1 year ago

Good point about imports. I think that generator functions are neat, but more complicated for users - so I was kind of hoping we wouldn't need them, but maybe we would for imports to scale.

I'm pretty nervous about doing another huge change now.

is to add a 4th type of distributions: "GeneratorDist". I think it'd be mostly compatible with what we have right now: we could treat it as a lazy SampleSetDist at first, and pull sampleCount samples from it when it's first used.

I would be interested/fine with doing something like this.

berekuk commented 1 year ago

I think that generator functions are neat, but more complicated for users

Do you have examples where that might be the case?

My hope for this idea is that we could make a switch completely transparently from users, mostly. They don't need to know that their x = 2 to 5 doesn't generate a list of samples immediately.