This looks like a great library for people who want to run A/B tests themselves, I haven't seen any other package that takes care of the whole stack and makes it this easy.
However, I'm wary of the statistics here, for a few reasons:
The package appears to give one-tailed p-values. This is debatable, but I believe that doesn't line up with how most people interpret their A/B tests and thus leads to an excess of false positives.
The package supports multiple comparisons but includes no correction or warning for it. Granted, very few A/B testing packages address this, but it can also lead to a (potentially serious) excess of false positives.
The package computes a table of four normal tail probabilities itself using a fairly crude method:
For the largest z-value, the computed tail probability is low by a factor of 3, which can be substantial for people trying to run high-confidence tests -- again, it's overestimating confidence which can again lead to excess false positives. It'd be simpler and more accurate to hardcode a lookup table -- tables of these values are readily available. (One could also make use of a statistical library or directly code a numerical approximation to the standard normal CDF.)
The package uses the non-pooled Z-test for two proportions in case where I believe the pooled test is more appropriate (because the null hypothesis is that the control and the variation have the same conversion rate). This tends to overestimate Z-values and thus, again, confidences.
The Z-test can be inaccurate for small samples, though this doesn't actually contribute much error since the package requires that the total number of conversions is at least 25 which keeps us in pretty safe territory. With highly uneven sampling weights we could get into small-sample trouble, but it's definitely less of an issue.
As an example of the potential for numerical accuracy issues, consider an experiment with 6/500 conversions in baseline and 20/500 conversions in one variation. The package reports 99.9% confidence, or a one-tailed p-value <= 0.001. Fisher's Exact Test gives a one-tailed p-value of 0.0042. The original ABBA gives a two-tailed p-value of 0.0063, corresponding to a one-tailed p-value of roughly 0.0033. So we're underestimating the one-tailed p-value by a factor of 3-4 and the two-tailed p-value (which is probably more appropriate) by a factor of 6-8. This is pretty substantial -- our long-run false-positive rate will be 6-8x higher than we expect (ignoring multiple testing issues).
I think this would be a really awesome contribution to the world of A/B testing with some more robust statistics. I'd suggest using the (original) ABBA JS library (or perhaps a port of the Python version to Ruby), which also gets you some nice confidence intervals on proportions and improvements. Together the two would make a pretty sweet solution to do-it-yourself A/B testing.
Full disclosure: I authored the original ABBA library.
This looks like a great library for people who want to run A/B tests themselves, I haven't seen any other package that takes care of the whole stack and makes it this easy.
However, I'm wary of the statistics here, for a few reasons:
The package computes a table of four normal tail probabilities itself using a fairly crude method:
resulting in
For the largest z-value, the computed tail probability is low by a factor of 3, which can be substantial for people trying to run high-confidence tests -- again, it's overestimating confidence which can again lead to excess false positives. It'd be simpler and more accurate to hardcode a lookup table -- tables of these values are readily available. (One could also make use of a statistical library or directly code a numerical approximation to the standard normal CDF.)
As an example of the potential for numerical accuracy issues, consider an experiment with 6/500 conversions in baseline and 20/500 conversions in one variation. The package reports 99.9% confidence, or a one-tailed p-value <= 0.001. Fisher's Exact Test gives a one-tailed p-value of 0.0042. The original ABBA gives a two-tailed p-value of 0.0063, corresponding to a one-tailed p-value of roughly 0.0033. So we're underestimating the one-tailed p-value by a factor of 3-4 and the two-tailed p-value (which is probably more appropriate) by a factor of 6-8. This is pretty substantial -- our long-run false-positive rate will be 6-8x higher than we expect (ignoring multiple testing issues).
I think this would be a really awesome contribution to the world of A/B testing with some more robust statistics. I'd suggest using the (original) ABBA JS library (or perhaps a port of the Python version to Ruby), which also gets you some nice confidence intervals on proportions and improvements. Together the two would make a pretty sweet solution to do-it-yourself A/B testing.