Open vladr opened 1 year ago
Thanks for detailed report @vladr !
I'll check this out 🤔 My expectation is the same as yours. The performance should've been in the same range as the previous version. Or at least, it shouldn't break/time out the whole dashboard.
For the workaround, I also need to review the math involved, it's been a while 😅 but If I'm not mistaken, lowering 10K beta_probability_simulations should impact the confidence over the winning alternative.
If I read the documentation correctly, altering beta_probability_simulations
should only affect the "Probability of being Winner" calculation on the dashboard, but not the "Confidence" (default) calculation. Is this correct?
In addition to the dashboard, beta_distribution_rng
is, I believe, also used by the Whiplash (multi-armed bandit) algorithm; we aren't using Whiplash at the moment, but could there also be an impact there? (potentially in excess of 1 millisecond per trial for large participant counts, based on the benchmark above).
Yes, that's correct. Thanks for digging up more into this. I couldn't get some quality time yet. :(
The default confidence calculation is done via ZScore and should be fast.
The only impact for altering beta_probability_simulations
is the % for the "Probability of being Winner" shown on the dashboard. We try to cache the result for the probability on Redis as it can be slow to calculate that every time the dashboard open... still, I think this should be improved.
Thanks for confirming, and for reminding me of the Redis cache! If anyone else stumbles over this ticket because of running into a Rack::Timeout::RequestTimeoutException
or equivalent (i.e. preventing the calculation from completing) when loading the console, I can confirm that manually "priming" the cache first in Rails console and then accessing the Dashboard also works as a palliative measure:
irb(main):001:0> Split::ExperimentCatalog.all_active_first.each { |x| print "#{Time.now} - #{x.name} ...\n" ; x.calc_winning_alternatives }
@andrehjr 👋🏻 I'm facing the same problem. Any progress on this issue or any best guess about when it might be addressed? The workaround above works, but ideally my team doesn't need to know about this gotcha in future versions.
Describe the bug When using Split 3.4.1, the Dashboard page loaded quasi-instantaneously. After upgrading to Split 4.0.2, the Dashboard can take so long to load (render), that the request times out, making the Dashboard unusable. See below for possible root cause.
To Reproduce Steps to reproduce the behavior:
Alternatively, execute the following (what
Split::Experiment#estimate_winning_alternative
calls when invoked by the Dashboard for an alternative with 4M participants and ~25% trial completion)--in my case, just this one call takes >10s!Expected behavior
beta_probability_simulations
"--but it's hard to ascertain what the consequence of doing that would be, since Issue #453 is still without an answer.)Additional context This is the stack trace at the time of the request timeout:
This is the
StackProf
summary forSplit::Experiment#estimate_winning_alternative
on one of the more problematic experiments: