mixture containing one point doesn't correctly sample for it, nor generate adequate statistics

NunoSempere commented 1 year ago

I.

See the summary statistics for:


mixture(0)

II.

See the samples for:

sampleN(mixture(0), 10)

berekuk commented 1 year ago

Still doesn't work correctly even with #1591, unfortunately.

berekuk commented 1 year ago

I'd like to flag this as pretty important.

I previously thought that it's not a big deal, because who cares if it's exactly zero or 1e-16.

But then I tried to use mx to simulate multiple discrete branches, and got confused why my code wasn't working properly.

Captura de pantalla 2023-10-06 a la(s) 14 56 51

https://www.squiggle-language.com/playground?v=0.8.5#code=eNptjj0LwjAURf%2FKJVMLfpCMQl106dzROKT2FQNJWpqIgva%2Fm5qKCi6Bc3Pv4d1ZQ626mLDrGmIbdlKeyoajAJcugYggpHujj2hv2WFuLjC3jvl3Z7lFpWxvqKKwaofO7rUPMZXuE1vVZ9IB9%2BkBHmn7SKTbWYyiwPuqcCYHjvUa2mEgZXCKZ8dYe1y7i2lQE2yMDM%2Fy5CHj6Z9MJJmYZK%2BF%2BFlIVtq%2B817XhiSbPkbpcjY%2BARYzYbM%3D

OAGr commented 1 year ago

Yea, that makes sense. I'd imagine that this specific issue isn't dramatically difficult? Maybe 3-20 hours, could be worth it soon.

NunoSempere commented 1 year ago

Incidentally, the approach that squiggle.c takes here is, I think, more robust, so that these things don't happen, because it defines variables as samplers, which are conceptually clearer.

In fake pseudo-code, it is doing something like:

let a = () => 0
let b = () => sample_from_normal(1, 10)
let mixture = (samplers, weights) => {
   let normalized_cumsummed_weights = cumsum(weights) / sum(weights)
   let p = Math.random()
   let i = findFirstIndex(normalized_cumsummed_weights, x => p < x) || samplers.length
   return samplers[i]()
}
let resutlSampler = mixture([a,b], [1, 9])
let result = Array.from({length: 10000}, _ => resultSampler())
console.log(result)

You get more verbosity, but the type of objects which a and b are is a bit clearer, and there is no difference between a sampler which samples from discrete vs continuous distribution.

I don't know to what extent you will find this useful, but I thought I'd share.

OAGr commented 1 year ago

Yea, I think that approach can be good in certain use cases. You can do similar things in Squiggle if you want, though it's not as emphasized.

berekuk commented 1 year ago

Representing all distributions as generator functions that lazily return samples is a very tempting idea to me.

Benefits:

no need to decide on sample count in advance (important especially in the dependency chain when we add imports, and also in mixtures as discussed above)
I think it might unlock the capabilities similar to WebPPL and similar probabilistic languages, i.e. Bayesian inference (we'd have to implement some MCMC algorithm for that, of course; but with our current SampleSet implementation, non-trivial Bayesian inference is not possible at all)
no performance costs when the model is executed, which is important for imports

Risks/problems:

it's unclear if cdf, pdf, mean and other statistics functions would be compatible with that approach
also I'm not sure about the long-term performance implications; pulling one one at a time from a deeply nested chain of generator functions is slower because of non-locality, but we might pull N samples and then it's fine? anyway, not a problem in the short-term, we have many other performance bottlenecks
we'd have to be careful with correlation semantics, but I think that's doable and might even be more natural than with SampleSets (I'd have to think more about it to verify)

One more meta-level benefit is that if we do this, it'd be much clearer to me that Squiggle is necessary, and not just a trivial layer of syntax sugar on top of JS.

One less extreme way to try this, without changing the entire language and its semantics, is to add a 4th type of distributions: "GeneratorDist". I think it'd be mostly compatible with what we have right now: we could treat it as a lazy SampleSetDist at first, and pull sampleCount samples from it when it's first used. Then we could make it more lazy when it's possible.

OAGr commented 1 year ago

Good point about imports. I think that generator functions are neat, but more complicated for users - so I was kind of hoping we wouldn't need them, but maybe we would for imports to scale.

I'm pretty nervous about doing another huge change now.

is to add a 4th type of distributions: "GeneratorDist". I think it'd be mostly compatible with what we have right now: we could treat it as a lazy SampleSetDist at first, and pull sampleCount samples from it when it's first used.

I would be interested/fine with doing something like this.

berekuk commented 1 year ago

I think that generator functions are neat, but more complicated for users

Do you have examples where that might be the case?

My hope for this idea is that we could make a switch completely transparently from users, mostly. They don't need to know that their x = 2 to 5 doesn't generate a list of samples immediately.

quantified-uncertainty / squiggle

mixture containing one point doesn't correctly sample for it, nor generate adequate statistics #1589