rethinkpriorities / squigglepy

Squiggle programming language for intuitive probabilistic estimation features in Python
MIT License
63 stars 8 forks source link

Vectorized sample #63

Open erwald opened 6 months ago

erwald commented 6 months ago

It would be nice to have a vectorized version of sq.sample. I often find myself having data frames or series that contain distributions, and when I want to expand those to medians and percentiles, I have to use apply which is slow and a bit verbose. Nicer would be to be able to sample the entire series at once. (This is a feature request.)

peterhurford commented 6 months ago

@erwald can you give me a code sample?

Basically I want to see two things to understand:

1.) Write the code that uses apply the way that actually works

2.) Write pretend code that uses sq.sample the way you ideally want it to work should this feature be implemented correctly

erwald commented 6 months ago

Here's some code:

import pandas as pd
import squigglepy as sq

N = 1000
series = pd.Series(range(1, 5)) * sq.norm(mean=0, sd=1)
print(series.apply(lambda row: sq.get_percentiles(row @ N, percentiles=[50]))) # works
print(sq.get_percentiles(series @ N, percentiles=[50])) # would be nice if it did work

The first print statement will output a series of medians:

: 0    0.046183
: 1   -0.003956
: 2   -0.016223
: 3   -0.025846

The second print statement does not work because you currently can't sample a series or data frame. I mostly want this as convenience, but it might also be possible to get performance benefits from doing this, since I believe you would get the performance benefits of vectorized operations (like multiply, etc.) when sampling?

(Of course the above example would also require get_percentiles to be vectorized in this way.)

erwald commented 6 months ago

A common use case here is that I have some time series of estimates represented as squigglepy distributions, and I want to get the medians and 5th and 95th percentiles (or w/e) for each row for plotting.