spotify / confidence

Apache License 2.0
255 stars 32 forks source link

Example for two-sample T-test for continuous variables? #82

Closed jpzhangvincent closed 1 year ago

jpzhangvincent commented 1 year ago

It seems the example(i.e Z-test) in the notebook(frequentist) is only for analyzing the binary metrics(like conversion rate). Does this package also support T-test for continuous variable? I saw the StudentsTTest requires to input both numerator_column and denominator_column columns (from a contingency table format?) so I'm not sure whether it's possible to perform the two-sample T-test just on one continuous variable column with the API. Any example and documentation would be appreciated!

iampelle commented 1 year ago

See the example starting in cell 12 of the frequentist notebook. The df has a nr_of_items column passed to numerator_column and a nr_of_items_sumsq passed to the numerator_sum_squares_column of the ZTest. These two columns are used together with the denominator_column to compute the variance of the continuous metric.

If you prefer you can use the StudentsTTest class instead of the ZTest class.

jpzhangvincent commented 1 year ago

See the example starting in cell 12 of the frequentist notebook. The df has a nr_of_items column passed to numerator_column and a nr_of_items_sumsq passed to the numerator_sum_squares_column of the ZTest. These two columns are used together with the denominator_column to compute the variance of the continuous metric.

If you prefer you can use the StudentsTTest class instead of the ZTest class.

I'm still a bit confused about the set up of the data frame. Just to confirm, it doesn't seem like each row represent a sample. Does the nr_of_items mean the average of a continuous variable of interest in a variant group, nr_of_items_sumsq represents the sum(x_i - x_mean)^2 and user means the number of sample size in a variant group? And the API expects the user to pre-calculate those statistics and construct the data frame like that. I'm wondering whether it's better to have a simpler API interface like scipy.stats.ttest_* to simply pass into two list of observations.

iampelle commented 1 year ago

At Spotify the sample size is often in the hundreds of millions, and then it's not very convenient to pass in every single observation, so we prefer using summary statistics.

To make it more concrete, let's imagine that nr_of_items is the number of playlists a Spotify user has created. Let's say we have five users in the control group who created 3,2,4,0,1 playlists respectively. Then nr_of_items would be the sum, 3+2+4+0+1=10 and nr_of_items_sumsq would be 3^2+2^2+4^2+0^2+1^2=30 and users would be 5. Similarly for the treatment group. Internally we can use these summary statistics to compute mean as nr_of_items/users and the variance as nr_of_items_sumsq/users-nr_of_items/users^2 and then we can use that to compute test-statistics and confidence intervals.

Does that make sense?

jpzhangvincent commented 1 year ago

At Spotify the sample size is often in the hundreds of millions, and then it's not very convenient to pass in every single observation, so we prefer using summary statistics.

To make it more concrete, let's imagine that nr_of_items is the number of playlists a Spotify user has created. Let's say we have five users in the control group who created 3,2,4,0,1 playlists respectively. Then nr_of_items would be the sum, 3+2+4+0+1=10 and nr_of_items_sumsq would be 3^2+2^2+4^2+0^2+1^2=30 and users would be 5. Similarly for the treatment group. Internally we can use these summary statistics to compute mean as nr_of_items/users and the variance as nr_of_items_sumsq/users-nr_of_items/users^2 and then we can use that to compute test-statistics and confidence intervals.

Does that make sense?

Ah .. that makes sense for the ease of computation and scalability. It takes me a while to wrap my head around but glad I understand better on the motivation now. It would be great to have some documentation on the notebook example and API. Thanks!