splitrb / split

:chart_with_upwards_trend: The Rack Based A/B testing framework
https://rubygems.org/gems/split
MIT License
2.71k stars 368 forks source link

Split Dashboard can take minutes to load after upgrading to Split v4.x (with RubyStats) #713

Open vladr opened 1 year ago

vladr commented 1 year ago

Describe the bug When using Split 3.4.1, the Dashboard page loaded quasi-instantaneously. After upgrading to Split 4.0.2, the Dashboard can take so long to load (render), that the request times out, making the Dashboard unusable. See below for possible root cause.

To Reproduce Steps to reproduce the behavior:

  1. Ensure a number of experiments exist, where each alternative has a few million participants and ~25% completion rate.
  2. Go the Split Dashboard
  3. Observe the time that it takes the page to load

Alternatively, execute the following (what Split::Experiment#estimate_winning_alternative calls when invoked by the Dashboard for an alternative with 4M participants and ~25% trial completion)--in my case, just this one call takes >10s!

> Benchmark.measure { 10000.times { Split::Algorithms::beta_distribution_rng(1_000_000, 4_000_000) } }
=> #<Benchmark::Tms:0x0000555f7b388508 @label="", @real=10.572727171995211, @cstime=0.0, @cutime=0.0, @stime=0.006018999999999997, @utime=10.566891999999996, @total=10.572910999999996>

Expected behavior

Additional context This is the stack trace at the time of the request timeout:

Rack::Timeout::RequestTimeoutException - Request ran for longer than 28000ms :
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/modules.rb:491:in `*'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/modules.rb:491:in `beta_fraction'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/modules.rb:464:in `incomplete_beta'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/beta_distribution.rb:61:in `cdf'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/probability_distribution.rb:148:in `find_root'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/beta_distribution.rb:84:in `icdf'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/beta_distribution.rb:89:in `rng'
  ruby/2.7.0/gems/split-4.0.2/lib/split/algorithms.rb:18:in `beta_distribution_rng'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:364:in `block in calc_simulated_conversion_rates'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:361:in `each'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:361:in `calc_simulated_conversion_rates'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:301:in `block in estimate_winning_alternative'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:299:in `times'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:299:in `estimate_winning_alternative'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:280:in `calc_winning_alternatives'

This is the StackProf summary for Split::Experiment#estimate_winning_alternative on one of the more problematic experiments:

==================================
  Mode: cpu(1000)
  Samples: 7719 (0.00% miss rate)
  GC: 44 (0.57%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      7309  (94.7%)        7309  (94.7%)     Rubystats::SpecialMath#beta_fraction
      7466  (96.7%)          71   (0.9%)     Rubystats::SpecialMath#incomplete_beta
        86   (1.1%)          64   (0.8%)     Rubystats::BetaDistribution#pdf
      7619  (98.7%)          45   (0.6%)     Rubystats::ProbabilityDistribution#find_root
        61   (0.8%)          43   (0.6%)     Rubystats::SpecialMath#log_gamma
       102   (1.3%)          41   (0.5%)     Rubystats::SpecialMath#log_beta
        28   (0.4%)          28   (0.4%)     (marking)
        21   (0.3%)          21   (0.3%)     Rubystats::BetaDistribution#initialize
        18   (0.2%)          18   (0.2%)     Rubystats::ProbabilityDistribution#check_range
        17   (0.2%)          17   (0.2%)     ActiveSupport::EachTimeWithZone#ensure_iteration_allowed
        16   (0.2%)          16   (0.2%)     (sweeping)
        ...
andrehjr commented 1 year ago

Thanks for detailed report @vladr !

I'll check this out 🤔 My expectation is the same as yours. The performance should've been in the same range as the previous version. Or at least, it shouldn't break/time out the whole dashboard.

For the workaround, I also need to review the math involved, it's been a while 😅 but If I'm not mistaken, lowering 10K beta_probability_simulations should impact the confidence over the winning alternative.

vladr commented 1 year ago

If I read the documentation correctly, altering beta_probability_simulations should only affect the "Probability of being Winner" calculation on the dashboard, but not the "Confidence" (default) calculation. Is this correct?

In addition to the dashboard, beta_distribution_rng is, I believe, also used by the Whiplash (multi-armed bandit) algorithm; we aren't using Whiplash at the moment, but could there also be an impact there? (potentially in excess of 1 millisecond per trial for large participant counts, based on the benchmark above).

andrehjr commented 1 year ago

Yes, that's correct. Thanks for digging up more into this. I couldn't get some quality time yet. :(

The default confidence calculation is done via ZScore and should be fast.

The only impact for altering beta_probability_simulations is the % for the "Probability of being Winner" shown on the dashboard. We try to cache the result for the probability on Redis as it can be slow to calculate that every time the dashboard open... still, I think this should be improved.

vladr commented 1 year ago

Thanks for confirming, and for reminding me of the Redis cache! If anyone else stumbles over this ticket because of running into a Rack::Timeout::RequestTimeoutException or equivalent (i.e. preventing the calculation from completing) when loading the console, I can confirm that manually "priming" the cache first in Rails console and then accessing the Dashboard also works as a palliative measure:

irb(main):001:0> Split::ExperimentCatalog.all_active_first.each { |x| print "#{Time.now} - #{x.name} ...\n" ; x.calc_winning_alternatives }
maciesielka commented 11 months ago

@andrehjr 👋🏻 I'm facing the same problem. Any progress on this issue or any best guess about when it might be addressed? The workaround above works, but ideally my team doesn't need to know about this gotcha in future versions.