rawls238 / react-experiments

React components for implementing UI experiments
319 stars 20 forks source link

enable higher order component for assigning experiment params #11

Closed rawls238 closed 9 years ago

rawls238 commented 9 years ago

@mattrheault @eytan @gusvargas

Playing around with a few things...curious what you guys think of this. I think this is similar to something @eytan previously suggested.

Basically this makes it so that consumers of the interface can just wrap their component with

parametrizeComponent(experiment class, experiment name, React.createClass...)

and then the parametrization can be done without having to modify the relevant React component at all.

I think this is a better interface than the current method of using the Parametrize component with withExperimentParams or with context for cases where you just want to parametrize the props of a single component.

I think that the current method should be preferred when you want to implement an experiment that may deal with nested components.

Regardless, I just wanted to get this up for now as my laptop is going to die soon - going to put some more thought into the tradeoffs between the two different methods of using the parametrize component.

eytan commented 9 years ago

Why pass the experiment name in addition to the experiment class? It seems that it should be sufficient to simply pass down an instance of an Experiment object. Can you give an example of exactly how this changes the API's semantics?

rawls238 commented 9 years ago

@eytan The reason it's necessary is because of the design decision to support both namespace and experiment classes with the same interface. The experiment name is of course redundant if you pass an experiment class but it is necessary if you pass a namespace since we need to ensure that exposure gets logged properly. Each use of any of the components should refer to a single experiment and since we have no context into what you're doing in the component the only way we can ensure that we only log exposure correctly is if you explicitly tell us the experiment name.

eytan commented 9 years ago

You should never need to reference an experiment name, even if using namespace. The PlanOut reference implementation of namespaces will log the experiment name, so perhaps I am misunderstanding something or this is a missing (but critical) feature from PlanOut.js' implementation of namespaces.

rawls238 commented 9 years ago

The scenario I'm referring to is effectively another manifestation of this issue: https://github.com/facebook/planout/issues/69. This library works with namespaces because in the JS implementation I added a getParams method to the namespace which takes an experiment name and fetches the experiment parameters for the assigned experiment assuming that the current user is enrolled in the experiment name that gets passed into the function.

More concretely, suppose you have two experiments in a namespace, experiment A and experiment B. One react-experiments component should refer to either experiment A or experiment B but not both. Suppose a user is enrolled in experiment B but a particular react-experiments component refers to experiment A. The only way we know that we shouldn't log exposure for this user (since it would be logged as exposure to experiment B) is if the component tells us that this component refers to the implementation of experiment A. This causes the component to fetch the experiment parameter values for experiment A only and consequently only log exposure if the user is in experiment A. Looking at the tests, I don't think I actually have a test for this particular case and I probably should. Moreover, another issue that arises with namespaces is the question of what the proper behavior should be when a user is not enrolled in the particular experiment that the component is implementing.

rawls238 commented 9 years ago

The other option is to replace experimentName with an array of relevant experiment parameters since this way we could do type validation and we can use the get function to control exposure logging which I think may be a better solution but curious what you guys think

gusvargas commented 9 years ago

@eytan

I think the confusion for why we need an experiment name is coming from fundamentally different perspective on how Namespace should be used. We may be abusing Namespace to tailor to our needs and perhaps there is a better approach – I'd love to hear your thoughts.

From the PlanOut docs:

Namespaces are used to manage related experiments that manipulate the same parameter. These experiments might be run sequentially (over time) or in parallel.

In most cases, our pages do not generate enough traffic to allow us to launch multiple experiments that manipulate the same parameter (within a reasonable amount of time). Therefore, we utilize Namespace as a method of guaranteeing mutual exclusion between experiments that are ultimately trying to optimize some metric, e.g., signup conversion rate. This means that we may have multiple experiments within a Namespace that actually have no parameters in common. For example,

import SignUpCtaSizeOnLP from './experiments/SignUpCtaSizeOnLP';
import DifferentValueProposition from './experiments/DifferentValueProposition';

/*
  SignUpCtaSizeOnLP params: {
    size: oneOf(['small', 'medium', 'large'])
  }

  DifferentValuePropsition params: {
    valueProposition: oneOf(['It will make your life better', 'Moar webscale'])
  }
*/

class SignupConversionNamespace extends Namespace {
  // ... snip ...

  setupExperiments() {
    this.addExperiment('SignUpCtaSizeOnLP', SignUpCtaSizeOnLP, 50);
    this.addExperiment('DifferentValueProposition', DifferentValueProposition, 50);
  }
}

Let's make the following assumptions:

In this case, if John hits the landing page and runs into a <Parameterize experiment={SignUpConversionNamespace} /> component, then downstream a call to SignUpConversionNamespace.getParams() will occur and cause exposure to be logged on DifferentValueProposition (the experiment he is assigned to) even though he didn't actually encounter the UI that we intended to "map" to DifferentValueProposition.

Both of these experiments share the same end goal and should be mutually exclusive but they do not have any params in common and alter completely different pieces of the UI. It seems like this approach may be slightly bending the idea of what a Namespace was created for and therefore requires us to know which experiment within a namespace a wrapped <Parameterize />'d UI is intended for.

I'm curious to hear your thoughts on this approach and/or any ideas on how we can maintain mutual exclusivity amongst experiments with different params while solving the exposure logging problem.

/cc @mattrheault @rawls238

eytan commented 9 years ago

Hi @gusvargas --- I see, so this does sound like an abuse of namespaces :) Part of the design of namespaces is to make explicit when experiments need to be mutually exclusive: experiments should be mutually exclusive when they set the same variables in the codebase. In most other circumstances, though, mutual exclusion is not a concern --- even if your experiments are targeting the same outcome. Here are a few reasons why:

(1) Mutual exclusion doesn't buy you anything if there are no cross-experiment interaction effects: Consider the case of two A/B tests (E1 and E2), with no treatment/control interactions (e.g., the font size users saw on the landing page does not affect how users respond to the different value propositions on the signup page). Overlapping experiments would not cause bias, since (at least by default), assignments of the experiments are independent in PlanOut, and so users would be distributed randomly among all conditions. This would only very slightly increase variance (the width of your confidence intervals), even in the case of two 50/50 experiments.

(2) The amount of overlap between two experiments is often quite small, and so even in the presence of strong interaction effects, bias is limited. Consider two overlapping A/B tests where 10% are assigned to the treatment in each experiment. Then, WLOG, only 10% of the users in E1 will be in the treatment of E2. Then even if the interaction effect were extremely strong --- say equal to the effect of E1 --- you would only have a bias of 10% in the direction of the interaction.

(3) In the presence of strong interactions, mutually exclusive experiments mask launch-relevant information:

(4) When E1 and E2 have strong and interactive effects and are large (so that the argument in (2) begins to fall apart), comparisons with the status quo are less relevant. Suppose you were in a state where you very bullish on E1, such that you are running a 50/50 test for E1, and were also simultaneously testing E2 (in the same namespace). If we have gotten to this point, then E1 is a strong launch candidate, and we should be prepared for a world in which the test condition for E1 launched. Running the E1 and E2 as mutually exclusive means that your baseline for E2 is necessarily the soon-to-be-obsolete status quo, whereas the treatment effects from an overlapping experiment would be weighted toward the options you are actively considering.

Finally, beyond the statistical concerns, unnecessarily trying to make experiments mutually exclusive makes it more difficult for teams to run simultaneous experiments, and can make a lot of aspects of running experiments more confusing and messy.

As an aside, namespaces aren't just for maintaining mutual exclusion -- they are also used to run follow-on experiments in cases where you want to consider additional variants / parameterizations without pushing code, so I think the way we intended them to be used is still relevant to organizations with limited numbers of users.

eytan commented 9 years ago

Also you might want to tinker with https://gist.github.com/eytan/03cfaf99203b8b73e367 to get an intuition for how these different factors affect bias and variance.

rawls238 commented 9 years ago

@eytan that certainly is an interesting way to look at it. It is obviously pointless to have mutually exclusive experiments when there is no possibility of interaction effects between experiments and definitely inhibits the ability to scale on an organizational level. We haven't been doing that but rather have been using namespaces both for follow-on experiments and for mutual exclusion between experiments where we perceive there is a non-zero probability that there will be potential interaction effects between experiments (usually when they are in the same UI or in the same UI flow).

Normally to simplify the resulting analysis of experiments we opt to run experiments in the same namespace when those experiments are aiming to optimize the same metric and we think there may be a chance that there are interaction effects between the experiments. This is simply so that we don't have to analyze our experiment conditional on the other experiments that are currently running and the different values they can take on. I always imagined that this isn't scalable when running many experiments with possible interaction effects since for each experiment you would effectively have to track [# of possible treatments for experiment in question] * [# of possible treatments for all the other experiments with potential interaction effects] potential outcomes for each possible treatment value (i.e. if you had two experiments, Exp 1 - A / B, Exp 2 - C / D then to analyze exp 2 you would have to consider the potential outcomes AC, BC, AD, BD) since you would have to build out a rather complicated multilevel model in order to analyze even the most simple experiments. It always seemed simpler to not have to deal with this. This being said, I've only been able to find good references / literature on interaction effects and remedies for interaction effects when it's between experimental units, not when it's between multiple experiments so perhaps I'm off base here (which makes sense to me, since running multiple experiments in parallel appears to be something that is rather unique to running experiments on web services), hoping you have some good ones to send our way :). It also makes sense that at Facebook the experiments that are run already need to be heavily guarded and analyzed against interaction effects between experimental units in a single experiment since the experiments that are run are in a networked environment where actions of certain users can affect the behaviors of other units, whereas we really don't have to deal with this for our products.

I also think this is an issue of experimentation scale within an organization as well as the resulting sample size of experiments - we usually have one, maybe two, person(s) at a time implementing and running experiments to optimize for a particular metric (i.e. viral invitations sent, activation rate, etc.). As a result we haven't really run into scenarios where this constrains our ability to iterate quickly. As well, most of our experiments can gather samples in the low 1000's given the time constraints we put on them for the tradeoff of being able to iterate more quickly (which is rather unfortunate but a reality). At these lower sample sizes we would be more sensitive to the noise from the resulting bias of running these experiments without mutual exclusion since this affects the variance which influences the power of the experiment and leads to more type 1 errors. Given that we don't feel constrained by only being able to run 2-4 experiments at a time focusing on a single metric, it makes sense to opt for attempting to eliminate bias and have a more predictable statistical power when possible for the main purpose of making things simpler.

That being said...I never thought about point (4) but it certainly is a good point. I never considered that accepting bias might be OK because your control would necessarily become invalid as you productized one of your other experiments. This line of reasoning implies that it's actually necessary to know the interaction effects between experiments that are running at the same time since not knowing the effects can render the other experiments to be effectively meaningless and necessary to start them again against the new status quo. Do you view the primary benefit of mutual exclusion to be the fact that it can prevent interaction effects that affect the user experience in very negative ways (i.e. having black text on black background)?

rawls238 commented 9 years ago

Also, regarding the preceding discussion as it relates to this PR, I think it makes sense to remove experimentName as a prop to parametrize completely and instead go forward with an array of params instead. I think this isn't strictly necessary (given the preceding it could be possible to simply only have the experiment class as a relevant prop / input), but I think it makes the code more semantic and more explicit to readers of the code what parameters parameterize is affecting without having to refer back to the experiment definition. However, one could argue this makes it more bug-prone due to typos but we can potentially sanity check that the parameters are valid using this from the planout ref implementation: https://github.com/facebook/planout/pull/76/files if we see this being an issue in the future

eytan commented 9 years ago

Thanks for the thoughtful response. I am definitely not advocating for the use of multi-level models or routinely testing experiments for interaction effects.

In general though the only time where I think it's necessary to run mutually exclusive experiments is when experiments set the exact same parameter. If you just want to use some kind of common label for experiments that are expected to influence an outcome, adding additional metadata to the experiment and logger might be the way to go.

I don't have a strong sense of how strong the interaction effects are in your setting, but one highly effective way of decreasing your standard error is to run larger experiments, which is easier to do when experiments don't have to be mutually exclusive :)

wrt how we use namespaces, one of the most common uses of A/B tests is to compare ranking models. This can be seen as randomly assigning users to different versions of rankers, and these experiments must be necessarily mutually exclusive because you can only sort content one way at a time. For my own work, I have been using namespaces for running adaptive experiments for which we incrementally roll out more optimal designs, while continuing to run the initial, iid random designs.

On Friday, October 2, 2015, Guy Aridor notifications@github.com wrote:

@eytan https://github.com/eytan that certainly is an interesting way to look at it. It is obviously pointless to have mutually exclusive experiments when there is no possibility of interaction effects between experiments and definitely inhibits the ability to scale on an organizational level. We haven't been doing that but rather have been using namespaces both for follow-on experiments and for mutual exclusion between experiments where we perceive there is a non-zero probability that there will be potential interaction effects between experiments (usually when they are in the same UI or in the same UI flow).

Normally to simplify the resulting analysis of experiments we opt to run experiments in the same namespace when those experiments are aiming to optimize the same metric and we think there may be a chance that there are interaction effects between the experiments. This is simply so that we don't have to analyze our experiment conditional on the other experiments that are currently running and the different values they can take on. I always imagined that this isn't scalable when running many experiments with possible interaction effects since for each experiment you would effectively have to track [# of possible treatments for experiment in question] * [# of possible treatments for all the other experiments with potential interaction effects] potential outcomes for each possible treatment value (i.e. if you had two experiments, Exp 1 - A / B, Exp 2 - C / D then to analyze exp 2 you would have to consider the potential outcomes AC, BC, AD, BD) since you would have to build out a rath er complicated multilevel model in order to analyze even the most simple experiments. It always seemed simpler to not have to deal with this. This being said, I've only been able to find good references / literature on interaction effects and remedies for interaction effects when it's between experimental units, not when it's between multiple experiments so perhaps I'm off base here (which makes sense to me, since running multiple experiments in parallel appears to be something that is rather unique to running experiments on web services), hoping you have some good ones to send our way :). It also makes sense that at Facebook the experiments that are run already need to be heavily guarded and analyzed against interaction effects between experimental units in a single experiment since the experiments that are run are in a networked environment where actions of certain users can affect the behaviors of other units, whereas we really don't have to deal with this for our products.

I also think this is an issue of experimentation scale within an organization as well as the resulting sample size of experiments - we usually have one, maybe two, person(s) at a time implementing and running experiments to optimize for a particular metric (i.e. viral invitations sent, activation rate, etc.). As a result we haven't really run into scenarios where this constrains our ability to iterate quickly. As well, most of our experiments can gather samples in the low 1000's given the time constraints we put on them for the tradeoff of being able to iterate more quickly (which is rather unfortunate but a reality). At these lower sample sizes we would be more sensitive to the noise from the resulting bias of running these experiments without mutual exclusion since this affects the variance which influences the power of the experiment and leads to more type 1 errors. Given that we don't feel constrained by only being able to run 2-4 experiments at a time focusing on a single met ric, it makes sense to opt for attempting to eliminate bias and have a more predictable statistical power when possible for the main purpose of making things simpler.

That being said...I never thought about point (4) but it certainly is a good point. I never considered that accepting bias might be OK because your control would necessarily become invalid as you productized one of your other experiments. This line of reasoning implies that it's actually necessary to know the interaction effects between experiments that are running at the same time since not knowing the effects can render the other experiments to be effectively meaningless and necessary to start them again against the new status quo. Do you view the primary benefit of mutual exclusion to be the fact that it can prevent interaction effects that affect the user experience in very negative ways (i.e. having black text on black background)?

— Reply to this email directly or view it on GitHub https://github.com/HubSpot/react-experiments/pull/11#issuecomment-144911876 .

rawls238 commented 9 years ago

thanks @eytan, incredibly helpful. Going to think about this some more but may send you an email with some follow-up thoughts / questions. Merging this PR in for now.