rethinkpriorities / squigglepy

Squiggle programming language for intuitive probabilistic estimation features in Python
MIT License
65 stars 8 forks source link

feat: correlated variables #41

Closed agucova closed 1 year ago

agucova commented 1 year ago

This PR implements the capability to correlate variables.

This works with arbitrary distributions (continuous or discrete) as long as the samples generated out of them produce enough unique numbers for a reshuffling of samples to be able to change the rank correlation. The implementation works on a best-effort basis, but is generally robust, and double checks that the samples produced are correlated as expected, explicitly failing if not.

How does this work?

This implementation relies on the Iman-Conover method, which induces a rank correlation directly on the samples, by reordering samples, so the resulting joint distribution matches a correlated reference distribution (in our case, a multivariate normal distribution).

To make this seamless, whenever a user samples one of the variables in a group of correlated variables, we sample all the variables in the group immediately and then transform the samples with the IC method. The leftover samples we didn't need get stored in their corresponding variables (through the _correlated_samples attribute).

Whenever those other variables get sampled, the samples get fetched directly from _correlated_samples, instead of sampling every variable again, and then _correlated_samples gets cleared (to allow resampling in the future).

Because this method relies on reordering existing samples, its biggest flaw is that it doesn't work when there are too many repeated samples, as reordering them won't change the rank correlation. This is often the case with discrete distributions, as it's common to only have a few unique values.

To handle this case, the method checks beforehand that the samples have the right properties to induce the correlation, and otherwise raises an exception to the user. I considered adding noise to samples to circumvent this problem, but with many discrete distributions this could affect the entire logic of a model, so for now I've decided to just acknowledge the limitation.

Testing

To test this exhaustively, I implemented automated property-based testing with Hypothesis, which helped me debug and refine several of the kinks and numerical instabilities with the implementation.

Right now, we have full coverage across all distributions and their defined parameter domains (ignoring cases where there isn't enough sample diversity), which means Hypothesis went through and tried as hard as it can to produce examples of distributions that couldn't be correlated, without success.

One thing I haven't tested extensively yet are composite distributions. I can add this through Hypothesis with a recursive strategy, but this might be overkill for now.

Next steps

This method could be improved. Using copula-based methods would mean no longer having to worry about sampling simultaneously and, most importantly, no limitations on how many unique values to have. I think this could be computationally more inefficient and more complicated to implement, but we could at least test it and see how it compares.

Other things

At some point I needed to add a check for continuous distributions (which is no longer necessary), so I refactored the class tree as to have ContinuousDistribution and DiscreteDistributions[^1] be base classes (under BaseDistribution).

Also, thanks to Hypothesis, I found bugs with how some distributions validated their parameters, some of which I've fixed here (see #46 for more).

[^1]: I renamed the old DiscreteDistribution to CategoricalDistribution, which I also believe is a clearer name. The discrete function remains.

matthewromer commented 1 year ago

This looks awesome! Is there an expected timeline for when this will get merged?

agucova commented 1 year ago

This looks awesome! Is there an expected timeline for when this will get merged?

I never saw this, sorry! Must have got lost among the GH notifications.

There's no reason we cannot merge now, actually. I was waiting for further testing downstream, but this can be included in the next dev release instead.