pymc-labs / CausalPy

A Python package for causal inference in quasi-experimental settings
https://causalpy.readthedocs.io
Apache License 2.0
919 stars 65 forks source link

What regions to choose for single or multiple unit testing? #319

Open drbenvincent opened 7 months ago

drbenvincent commented 7 months ago

In the context of geo testing... Let's say we have historical data of some KPI (such as sales, or customer sign ups) across multiple geographies. And we are considering running some geo testing on a single (or multiple) regions. That is, we are planning on an intervention in one (or multiple) geographical regions. The question at the top of our minds is "what region (or regions) should we select to receive the intervention?"

This question has been tackled in this GeoLift docs page, and uses power analysis as the basic approach.

We can formalise this a bit as follows: How to best select n test regions out of the total pool of all regions?

The general approach we can use here is to iterate through a list of all possible valid test regions. For each case, we can use some kind of scoring function. We can then evaluate the scoring function for each possible valid set of test region(s), then pick the best. So something along the lines of:

from itertools import combinations

regions = ['Region1', 'Region2', 'Region3', 'Region4', 'Region5', 'Region6']

# Possible numbers of regions to test in
n_regions = [1, 2, 3]

# Generate all combinations of test regions
test_regions = []
for n in n_regions:
    test_regions.extend(combinations(regions, n))

That will give us a list which looks like this:

[('Region1',),
 ('Region2',),
 ('Region3',),
 ('Region4',),
 ('Region5',),
 ('Region6',),
 ('Region1', 'Region2'),
 ('Region1', 'Region3'),
 ('Region1', 'Region4'),
 ('Region1', 'Region5'),
 ('Region1', 'Region6'),
 ('Region2', 'Region3'),
 ('Region2', 'Region4'),
 ('Region2', 'Region5'),
 ('Region2', 'Region6'),
 ('Region3', 'Region4'),
 ('Region3', 'Region5'),
 ('Region3', 'Region6'),
 ('Region4', 'Region5'),
 ('Region4', 'Region6'),
 ('Region5', 'Region6'),
 ('Region1', 'Region2', 'Region3'),
 ('Region1', 'Region2', 'Region4'),
 ('Region1', 'Region2', 'Region5'),
 ('Region1', 'Region2', 'Region6'),
 ('Region1', 'Region3', 'Region4'),
 ('Region1', 'Region3', 'Region5'),
 ('Region1', 'Region3', 'Region6'),
 ('Region1', 'Region4', 'Region5'),
 ('Region1', 'Region4', 'Region6'),
 ('Region1', 'Region5', 'Region6'),
 ('Region2', 'Region3', 'Region4'),
 ('Region2', 'Region3', 'Region5'),
 ('Region2', 'Region3', 'Region6'),
 ('Region2', 'Region4', 'Region5'),
 ('Region2', 'Region4', 'Region6'),
 ('Region2', 'Region5', 'Region6'),
 ('Region3', 'Region4', 'Region5'),
 ('Region3', 'Region4', 'Region6'),
 ('Region3', 'Region5', 'Region6'),
 ('Region4', 'Region5', 'Region6')]

Let's continue with our approach:

# Example scoring function
def scoring_function(test_regions, remaining_regions):
    # Dummy implementation of the scoring function
    return len(test_regions) - len(remaining_regions)

# Store scores for each combination
scores = {}

# Iterate over each combination
for test_combo in test_regions:
    # Remaining regions are those not in the current combination
    control_regions = set(regions) - set(test_combo)
    # Evaluate the scoring function
    score = scoring_function(test_combo, control_regions)
    # Store the score with the combination as the key
    scores[test_combo] = score

So the question is, what happens in scoring_function? Well that depends on what we want to do, but it will be along the lines of:

Note: Running synthetic control with multiple treatment regions would benefit from having addressed https://github.com/pymc-labs/CausalPy/issues/320 first.


All of this should be wrapped into a function (or class) called something like FindBestTestRegions with a signature of something like:

Input:

Output:

cetagostini commented 5 months ago

Hey team, this topic is super interesting πŸ”₯

First I have a personal opinion regarding the Meta approach, because I feel can be improved. Many times knowing where you can execute an intervention is useless. Users don't want to experiment for the sake of experimenting, they want to intervene in regions/geos where they can derive information.

Example: Let's imagine I'm running a mobility company that operates globally. I decide to use GeoLift package's MultiCellMarketSelection feature to identify the best location for a marketing intervention. The tool suggests the Shetland Islands as the ideal place to run a test. After conducting the test, I find that our marketing campaigns perform exceptionally well there. As a result, I replicate the same strategies in other regions in the UK. Surprisingly, the outcome is not as successful despite investing more resources. Could it be that the initial test results were misleading?

The test never lied, but we analyzed a geography with totally different patterns from the other geographies that really interested us. If the geography where we can intervene is not of our interest or is it too different from the place where we really dedicate 90% of our budget, even if it is the best place to experiment, why should we do it there?

We can extrapolate this example to more than one region. What if the pull of intervention regions are not relevant?

Based on my experience, all marketers have a clear idea of where they want to test new strategies. We aim to gather insights from London or Manchester, where the majority of our resources are concentrated, to better understand local consumer behavior, as this is where 80% of our impact lies. The challenge lies in identifying the most suitable geographic locations to accurately represent my target area and minimize the MDE in my experimental methodology.

What if we turned the question around and asked, "If I want to conduct an experiment in X Geo (London), what would be the optimal control geo (City) to select?". Where our objective function is increase the sensitivity of my experiment (reduce MDE).

We can certainly train several models estimating the power or minimum detectable effect for each combination of predictors and target variables. This process will provide us with two potential courses of action:

  1. We train different models with different cities as controls and select the one where our posterior is narrower. This should mean lower MDE.
  2. We train different models with different cities as controls, adding a simulated effect, and we see those models where said effect is outside our "ROPE" threshold. Picking the one which capture the smaller effects.

Based on my experience, a low MDE usually corresponds to a low MAPE or a high R2. With this in mind, optimizing these metrics typically results in a good MDE. When it comes to optimizing R2 or MAPE in our regression, we have multiple options to consider.

Lasso for Control Selection

I have been using Lasso technique to select the best regressors from a pull. Because Lasso regressions tend to shrink the coefficients of less important variables to zero during model fitting, we can use it to measure feature relevance. Additionally, Lasso is easy to implement within the Bayesian Framework.

Here is an example of how we could apply it, using our synthetic control dataset as a demo!

As I've been explaining: let's suppose we aim to conduct our experiment in the 'actuals' geography. By using Lasso approach, we can observe alternative locations that may account for a greater variance in our target, which ultimately contributes to a better counterfactual generation.

import pymc as pm
from sklearn.preprocessing import StandardScaler

df = cp.load_data("sc")
actual_treatment_time = 70
fake_treatment_time = 60 
power_df = df[:actual_treatment_time].copy()
X = power_df.drop(columns=['actual', 'counterfactual', 'causal effect']).values
y = power_df['actual'].values

# Standardize the predictors
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Number of predictors
n_predictors = X_scaled.shape[1]

# Hyperparameters
lambda_param = 1  # Regularization parameter, can be optimized or tuned

with pm.Model() as model:
    # Define priors for the intercept and coefficients
    intercept = pm.Normal('intercept', mu=0, sigma=10)
    betas = pm.Laplace('betas', mu=0, b=lambda_param, shape=n_predictors)

    # Linear model
    mu = intercept + pm.math.dot(X_scaled, betas)

    # Likelihood
    sigma = pm.HalfNormal('sigma', sigma=1)
    likelihood = pm.Normal('y', mu=mu, sigma=sigma, observed=y)

    # Posterior sampling
    idata = pm.sample(2000, tune=1000, random_seed=42)

# Extracting the beta coefficients
betas_trace = idata.posterior['betas']

# Mean of the betas across samples
betas_mean = betas_trace.mean(dim=['chain', 'draw']).values

# Determine which coefficients are significantly non-zero
credible_intervals = az.hdi(idata, hdi_prob=0.95)

# Extract the 'betas' variable from the credible intervals dataset
betas_credible_intervals = credible_intervals['betas']

# Loop through each beta coefficient's indices to check their credible intervals
selected_columns = []
columns = df.drop(columns=['actual', 'counterfactual', 'causal effect']).columns

for idx, column in enumerate(columns):
    lower, upper = betas_credible_intervals.sel(betas_dim_0=idx).values
    if not (lower <= 0 <= upper):
        selected_columns.append(column)

print(f"Selected Columns: {selected_columns}")

With this application, we can determine the best geos to estimate our target (actuals). Those are a, c, and f. This means that these geographies would be the best to leave as a control, if we need to intervene in 'actuals'

Does this really work?

  1. If we compare the fit of the model, before the intervention, using all the regressors or using only those three, we can see that the estimate is almost the same. Consequently the power should be almost equal.

Using a, c, and f only:

Screenshot 2024-06-18 at 01 29 49

Using all:

Screenshot 2024-06-18 at 01 30 06
  1. Additionally, you can see that the amount of effect detected with only these three regressors only is almost the same as that detected when using all the regressors.
Screenshot 2024-06-18 at 01 25 18

Conclusion

We can create a method where the user defines only the geos of the dataset where they are interested to intervine. intervention_geos=[ a,b,c, . . .]. Then we execute the lasso selection on each one, excluding the geos of interest from the pull as well. As a result, the user will know which regions to leave in test, if he wants to do a test in regions A, B, C.

By doing this we could:

  1. Simplify the process, we do not need to define a scoring function, we can use more traditional and sufficiently effective methods.
  2. We make code execution more efficient, because we would need fewer loops.
  3. We add clarity to the problem statement around the experiments.

This is just my opinion (Happy to be challenge), and a little of context on how I have approached the problem, and the issues I've experienced when using the models from Meta. But I would love to hear them and know if I'm crazy, or maybe there is some sense! πŸ”₯

drbenvincent commented 5 months ago

Awesome thoughts @cetagostini. Just a few quick responses.

Both approaches?

I think it makes a lot of sense to go after both approaches. So we'd end up having something like:

But because I don't have experience of applying this in real situations... do we even need to do this? If we know what test regions to run, then all other regions are by default control regions. So can't we just run our test in the geo's that we want to, then run regular synthetic control after the fact and that will choose the set of control geo weightings. The only difference is that we've not decided those weightings (or which geo's will be weighted at zero) in advance of running the test. [Also see next point]

I have had questions from a client who genuinely wants to know what the best test regions are. This is a very general problem, so this doesn't give anything away...

If geo's have a distribution of market sizes, some are big, some are small. Running a test in a large region would cost a lot of money. So we want to test in region(s) which are smaller to fit within our budget. But we also don't want to pick regions which we know are not representative because we might not be able to generalise the causal impact from the test regions to other regions.

So I don't think we should throw out the idea of FindBestTestRegions functionality - just that the way how it's done needs to be well thought out. For example, geo attributes (like population size, marketing spend, geographical distance, etc) could be factored in. Costs of an intervention in each region could be factored in.

Sparsity

The LASSO idea is an interesting one. There could be some slight complications in using LASSO for geo selection but then ending up analysing with synthetic control. If that does turn out to be problematic, then we could consider altering the hyper parameters of the Dirichlet distribution so as to favour fewer more strongly weighted regions. So basically take that idea of sparsity from LASSO but implement it in the synthetic control WeightedSumFitter model class.

Question

I don't have experience of actually using these methods in the real world. It seems like it's 'easier' to calculate the posterior in the situation of no effect, in order to calculate a ROPE and make statements about minimal detectable effects etc. But when it comes to simulating effect sizes, this seems more complex. Should you simulate an immediate and sustained impact, or consider some other time course of impact? We don't know what the time course of the impacts are, so this seems more tricky to get right.

cetagostini commented 5 months ago

Hey @drbenvincent

Agree we can have both, maybe I'm very bias by my usual work use cases! πŸ™ŒπŸ»

But because I don't have experience of applying this in real situations... do we even need to do this? If we know what test regions to run, then all other regions are by default control regions. So can't we just run our test in the geo's that we want to, then run regular synthetic control after the fact and that will choose the set of control geo weightings. The only difference is that we've not decided those weightings (or which geo's will be weighted at zero) in advance of running the test. [Also see next point]

Very valid point, I have been there, and to add context, you'll be assuming that using all or some other geos would give us a good enough model. What happens with that is sometimes, putting all the elements in the blender and clicking on it results in a very poor result.

If we execute an activity in Y blindly, without first verifying which other regions could serve as a control, we could end up in a situation where after the test you want to use region X, Z as a control but these give a poor counterfactual, with a high MDE, which ultimately makes your experiment end up with little significance.

If you had known this before, then we could have restructured your experiment, or simply looked for other regions. I think that is the value of the function FindBestControlRegions. However, now that I write it looks related to the planning stage and not so much as a methodology in itself πŸ˜…

If geo's have a distribution of market sizes, some are big, some are small. Running a test in a large region would cost a lot of money. So we want to test in region(s) which are smaller to fit within our budget. But we also don't want to pick regions which we know are not representative because we might not be able to generalise the causal impact from the test regions to other regions.

This is a pretty interesting case, I think it proves my point a bit. They know exactly which regions they want to take for tests (the cheap regions -> Because they have an small budget) but even within those regions, they need to make a sub-selection!

Thanks for sharing the context, definitely can imagine FindBestTestRegions here!

Sparsity

By issues do you mean maybe Lasso can zero out some predictor coefficients? which might translate into zero weights for some regions in the synthetic control? Just to be sure we are thinking the same!

If that does turn out to be problematic, then we could consider altering the hyper parameters of the Dirichlet distribution so as to favour fewer more strongly weighted regions.

I like this idea, do you think we'll end up with two typologies of the WeightedSumFitter? A Lasso base and another Dirichlet base (similar to the current)?

Simulated Effect

I don't have experience of actually using these methods in the real world. It seems like it's 'easier' to calculate the posterior in the situation of no effect, in order to calculate a ROPE and make statements about minimal detectable effects etc. But when it comes to simulating effect sizes, this seems more complex. Should you simulate an immediate and sustained impact, or consider some other time course of impact? We don't know what the time course of the impacts are, so this seems more tricky to get right.

Great question, from my experience, indeed ROPE makes things simple. That's why I like it πŸ˜…

However, here I have no opinions about what could be better. Sometimes according to my assumptions and mood, if I think the effect will be continuous then I simulate it constantly and immediately during a fake period without effect and see if the model recovers it. I did a package doing this like a one/two years ago.

But since it is a very naive way, and sometimes I have the assumption that the effect could vary in time, I have a function that generates a random walk time series of a given length, ensuring that the arithmetic mean of the time series equals a specified target means (simulated mean effect).

def generate_simulated_random_walk_impact(length, mean_effect):
    # Generate standard normal steps
    steps = np.random.normal(loc=0, scale=1.0, size=length)
    # Create the random walk by taking the cumulative sum of the steps
    random_walk = np.cumsum(steps)
    # Adjust the random walk to achieve the target mean
    current_mean = np.mean(random_walk)
    adjustment = mean_effect - current_mean
    adjusted_random_walk = random_walk + adjustment
    return adjusted_random_walk
Screenshot 2024-06-18 at 15 03 04

Those random walks with that mean X represent your effect and are multiplied over the period of the same length where no effect exists. Usually, you can say mean_effect=1.2 to indicate 20%, for example.

Once you have those different time series with different patterns but with the same total effect, you can compute the distribution of your daily or cumulative value (you decide), and once you have the posterior of your model, you can compare it with the distribution of your effect. If the posterior and the effect's distribution are significantly different, it indicates that your model is detecting it more accurately.

You can actually create an score around it, computing the difference between the distributions and finally determinate MDE.

However, in the end that whole process is not much different than using ROPE during a period of no effect, and ROPE is much shorter and simpler. But would it still be worth exploring?

drbenvincent commented 5 months ago

I think at this point I'm yet to be convinced how useful FindBestControlRegions would be. In my mind, the weighted sum synthetic control model does the job of choosing weightings of all non test regions. In the current implementation it does this only on the basis of 'fitting' the test region(s) in the pre-treatment era, but there's no reason why we can't incorporate other factors. For example, you might want to bias weights higher to regions which have attributes more similar to the test region(s). But maybe there is value in a priori selecting a set of control regions (equivalent to setting the weightings of the remaining regions to zero) before the intervention, but I think I still need a bit of convincing.

In terms of sparsity of the synthetic control, I think the simplest way would be just to allow the user to specify hyper parameters of the dirichlet distribution of WeightedSumFitter. That said, there are alternative synthetic control modelling approaches (e.g. don't constrain sum of weights to equal 1) and it's very likely that we will implement some of those at some point.