Open djMax opened 7 years ago
Crickets on this one? I guess maybe it's square peg round hole...
Sorry for the crickets, @djMax . š
You are not the only one interested in a system to solve this problem. The Wasabi system was developed (and we think works pretty well) for traditional A/B testing, that is, you have a control and one or two experimental buckets and then you put some people into the experimental buckets and compare their success with the tested feature versus the control (or each other, if it is A/B/N testing). If you are recording impressions and a single success action, we even give you graphs in the UI and will attempt to let you know when one of the buckets has reached a statistically significant result.
However, that doesn't mean this scenario is invalid. In fact, as I mentioned above, it seems to be exactly the second thing people want to do once they have done a traditional A/B test (and sometimes the first thing š ). One of the problems with using Wasabi to do this is actually a feature of Wasabi. Basically, if you are not using 100% sampling percentage, then the way you want a normal test to work is, if someone is in the test, they get put in a bucket and it is "sticky". That is, when they come back and your code asks for their assignment again, you will get back the same answer. But it is also a goal that people who weren't put in the test, e.g., people who will be getting the default, non-test experience, will also get that experience the next time they come back. For that reason, people who are not in the test, due to sampling percentage, are actually given a bucket assignment, to the "null" bucket. That way, when they come back, they will always get the default experience.
However, that doesn't work when you are trying to do a controlled rollout using Wasabi. But it turns out there IS a way to do this.
Basically, the idea is that you create an experiment and set the sampling percentage to 100%. You then create (at least) two buckets, one called "control" and the other called "test" (or whatever makes sense to you). You set the allocation percentages of those buckets to your initial rollout numbers, e.g., 90% for control and 10% for test. Now you implement in your code a check for assignment for this experiment. If the user is assigned the "test" bucket, you show them your test experience. If they get ANYTHING ELSE, you give them the default experience. Once you have started the experiment and deployed the app, you will start showing the test experience to 10% of the traffic, and the other 90% will be assigned to the control bucket and will get the default experience.
Next, you want to increase the rollout of your test experience. To do this, you go into the Wasabi Admin UI and edit your experiment. You click on the Edit Buckets link on the Buckets tab, read the important warning, which you ignore and click on the OK button. You now see the Edit Buckets dialog. For each bucket, there is a Close button and an Empty button. If you want, you can read the help that you can reach by clicking on the ? icon next to the EMPTY heading, but basically, you want to empty the "control" bucket by clicking on the trash can icon and saying OK to the warning dialog. This will cause everyone in the "control" bucket to now have NO assignment. Next, you want to create a new bucket, call it "control2" or whatever, and make it have the new rollout percentage. For example, if you want to increase the percentage of your traffic that will see your test experience from 10% to 30%, make the percentage for "control2" be 70%. When you save that, the percentage for "test" will be changed to 30% and the experiment will now start handing out "test" to 30% of the users and "control2" to 70% of the users. These numbers only apply from that point on, that is, those who already have "test" will still have it. But those who had "control" will now either get "control2" or "test", depending on the dice roll determining if they are in the 70% or the 30%.
Hopefully that was clear and will meet your needs. Let us know if you still have questions or feature suggestions.
Thanks for this- very informative. It does mean, I think, that you can only REDUCE the rollout by the size of a bucket. I.e. You might want five 5% buckets to get 25 so you can remove one to get to 20%. I suppose the alternative is to use the payload to control what a bucket means - i.e. Make 20 buckets of 5% and just change the meaning to go in or out of the test. In both these cases it makes outcome tracking hard, but I suppose that's asking to have cake and eat it too.
But another question- we have millions of users. Will that bucket deletion cause problems?
I guess I might not fully understand your use case. What most people who've wanted something like this wanted was a way to control and INCREASE rollout. That is, start with 5%, then 10%, and eventually all users. Is that what you want to achieve?
In order to do that, you only really need two active buckets. One is "Control" or "People not in the test". The other can be called "Test". Your code won't be looking for people in the Control bucket, only for people in the Test bucket. That is, when the code gets the response from the assignment API, it will be checking if the user was assigned to the Test bucket. If so, they get the test experience. If not, just go on with the default experience, but it doesn't matter what bucket the non-Test people are in, just that they aren't in the Test bucket. When you increase the rollout, you throw everyone out of the Control bucket and then create a second bucket, called Control2 at a smaller allocation percentage (for my example above, changing from 5% to 10%, you would make the allocation percentage of Control2 be 90%) and also increase the allocation of the Test bucket. That means that the people already assigned to Test still get Test, and now a larger percentage of the people being assigned to the experiment in the future will get Test.
Note that what you're really doing is changing the probability that new, unassigned users will be assigned to Test. You are not actually "assigning a given percentage of users to the test". If you aren't trying to implement phased rollout using Wasabi, which is what we're trying to do here, then those are equivalent. In other words, if you have a 5% allocation to a bucket and you never change it, then 5% of the users who come to your site will be put in that bucket. If all of your users eventually come to the site, 5% of your entire user base will be in the test. But things get a lot fuzzier when you try to use Wasabi for phased rollout control. You can get something like that effect through the technique I have described, but you can't really get a quantified "5% of my users today, 10% of my users tomorrow" effect.
BTW, I was discussing this with one of the other engineers and he actually addressed your last question. In fact, it is very likely that if you happen to have an experiment and you have assigned millions of users to the Control bucket and then you empty the Control bucket, you will probably take your Cassandra servers down. ā¹ļø I suppose it is theoretically possible you could create a beefy enough set of servers that they could handle that, but it is definitely an issue.
Sometimes a feature rollout could go wrong, and you'd want to reduce the people receiving the feature. So it sounds like from what you're saying, even this option isn't workable with wasabi. It wouldn't really make sense to build servers beefy enough to handle repeated 10 million+ deletes just to work around a "design mismatch..."
I think the other option is to fully partition the set up front - i.e. make 20 5% buckets - and then use the payload to define what that bucket means. And then when you want to modify the rollout %, just modify the payload of the relevant slices. Would that work?
Actually, I guess so! I hadn't thought of that, but it should work. All users will get distributed among the 20 buckets and then you can just turn on or off your feature by modifying the payload. Interesting solution.
We also might look into what it would take to do the emptying of buckets in a different way that wouldn't have the potential of taking the system down.
Great to hear there might be some movement here. What I'm doing at the moment is actually fronting wasabi with my own service. If an experiment is a "feature flag," I'm going to treat it differently. I want to mark the experiment in such a way that it's a "non-assigned experiment." For these, I will just pull the bucket information and use a mod/ring buffer approach to uniquely assign each visitor to a group, but not record that assignment in wasabi. Then I can flex up and down at will by just modifying the buckets. A weird use of wasabi, I understand.
My question... Is there a reasonable way to "tag" an experiment in some way? Experiments don't seem to have payloads like buckets do.
I understand generally how the bucket assignment stuff works, I think, but I wonder how one handles a test that morphs over time. So let's say I want to rollout feature X slowly to my user base. I make an experiment that assigns 5% of the users to the new feature. Should this be a sample rate of 5%, or a sample rate of 100% with a bucket at 5% and a bucket at 95%?
Now, I think things are going well, so I want to up the percentage of people getting the feature to 10%. What's implied here, IMHO, is that the percentage going forward should be 10%. So the 5% that were in it before should still be in it. And 5% more, INCLUDING from the set of people NOT in the feature bucket before, should now get in the bucket. I don't see how one can do this with the current infrastructure. I think if, instead of EXISTING_ASSIGNMENT, the system returned the time of the assignment (or some similar serial number), AND there was some way to delete an assignment, I could make it work.
And finally, let's say it all goes to heck and I want to give 0% of the users the feature. Would I have to delete all the bucket assignments?
But perhaps I'm thinking about it wrong. I do know that we want a tool that is both A/B and feature gating, because so much of those two things are shared concerns. Happy to muck with code to get there, but want to make sure I'm not missing something.