pymc-labs / CausalPy

A Python package for causal inference in quasi-experimental settings
https://causalpy.readthedocs.io
Apache License 2.0
827 stars 51 forks source link

Add new multivariate interrupted time series functionality #357

Open drbenvincent opened 1 week ago

drbenvincent commented 1 week ago

Proposal

Typically, interrupted time series (ITS) designs are univariate in that there is a single outcome variable. An existing example in the docs examines the causal impact of the onset of covid upon the univariate outcome measure of excess deaths.

However, there are plenty of scenarios where we might want to consider multivariate outcomes. Examples might include:

The simplest approach would be to run multiple ITS models to assess the causal impact of the intervention upon each outcome independently. However, the main limitation of this is that it assumes the outcome measures are independent and it will fail to capture the joint effects. We also have to run multiple analyses rather than just one.

Boundary conditions of when to use this approach

The multivariate outcome approach is particularly useful when it is not appropriate to think about more complex causal structure between the different outcome measures. The example of a marketing intervention and assessing the impact upon the sales of multiple products is perhaps a good one where multivariate ITS could be useful. This would be appropriate if the intervention has widespread impacts upon the sales of multiple products, but there isn't (or we want to remain agnostic) some complex causal chain of influence between multiple products.

However, if the outcomes are less similar, then it might be possible to think up more complex structural models to describe the causal relationships between the various measures. For example, if the intervention is a public health measure.

General approach

The schematic shows roughly what I'm proposing. We have a number of outcome time series and a single point where an intervention takes place (though this could be relaxed). We can use all available current ITS modelling approaches by using the model formula approach. We'd them simply generate a counterfactual prediction for each time series.

Paper Sketches 95

We could either model the variance with individual sigma parameters, or we could use a covariance matrix.

API

We'd stick relatively closely to the current API. The main difference would be that we specify a list of model formulas:

result = cp.pymc_experiments.InterruptedTimeSeries(
    df,
    treatment_time,
    formula=["prod1_sales ~ 1 + t + C(month)",
             "prod2_sales ~ 1 + t + C(month)",
             "prod3_sales ~ 1 + t + C(month)"],
    model=cp.pymc_models.LinearRegression(sample_kwargs={"random_seed": seed}),
)

There would be no requirement to have the RHS of the formulas the same. And we may want to pass in a list of models also.

EDIT: If we did want a single model to be applied to all time series, we could perhaps use formulae. @tomicapretto mentions...

In formulae you can do c(y1, y2, y3) ~ x + z and the outcome will be a matrix with as many columns as names in c(). The c() is like c() in R that stands for concatenate

Extensions


I'd be very interested to hear if there is any interest in this kind of functionality, or if you think it doesn't add much.

charlesdtdb commented 1 week ago

Hello! This is great and could solve one of my measurement needs.

Recently, Facebook impressions dropped due to a budget change, leading to a simultaneous drop in traffic from Facebook, direct, shopping, and brand search. There’s little doubt that Facebook impressions are the influencing factor, but I’m still figuring out the best way to measure and prove causality.

Also, since spend and impressions are continuous with diminishing returns, I wanted to know if this model works with continuous data.

Thank you!

drbenvincent commented 1 week ago

Thanks, good to know this might be of use @charlesdtdb. Can I just check, by continuous data, you mean that impressions are thought of as a graded/continuous treatment or causal driver of your other observations?

My initial thoughts are that that problem would warrant a novel model. I have some ideas.

Happy to either talk through to see if you'd like to create a custom model and submit a PR. Though if this is high priority/value, then it could be useful to consider a consulting engagement.

KK8627 commented 1 week ago

Thanks for sharing @drbenvincent. This would be quite useful for analysing the impact of brand campaigns on a product portfolio by specifying a list of model formulas within the InterruptedTimeSeries function. Since each formula would represent a separate time series model for each product's sales, this would allow capturing the intervention's effect on all products simultaneously while accounting for potential correlations between their sales. Building on a limitation that you already pointed out, the model's performance can be sensitive to the choice of priors when dealing with covariance. Basically, inappropriate priors can introduce bias and distort the results in situations where the sales of one product influences the other. The model formula would also differ across categories like for a Bank vs. a CPG brand. Would love to explore this further and will share any other concerns if I come across any.

drbenvincent commented 1 week ago

Hi @KK8627. What could be pretty cool is if you were able to either share a dataset OR code up a simple simulated dataset with numpy for example. If you were able to keep the example simple, but also capture the core statistical nature of the datasets you deal with, then that could form part of the new documentation. It would also mean that the example could be grounded in a realistic case, even though it's simulated. If so, feel free to drop in some numpy code here.

KK8627 commented 1 week ago

Certainly. Here you go. This would be different from my real-world data but might give an idea.

n_timesteps = 100
n_vars = 3

e1 = np.random.normal(0, 1, size=n_timesteps)
e2 = np.random.normal(0, 1, size=n_timesteps)
e3 = np.random.normal(0, 1, size=n_timesteps)

y = 0.5 * np.random.rand(n_timesteps)
lagged_y = np.lib.pad(y, (1, 0), mode='edge')[:-1]
y = lagged_y + 0.8 * lagged_y + 0.4 * e1

x1 = y + 0.3 * e2
x2 = 0.2 * np.random.rand(n_timesteps) + 0.7 * e3

seasonality = np.sin(2 * np.pi * np.linspace(0, 1, n_timesteps))

marketing_campaign = np.zeros(n_timesteps)
marketing_start = 2
marketing_end = 32
marketing_campaign[marketing_start:marketing_end] = 1  # keeping it simple for this example :)

data = np.vstack((x1, x2, y, seasonality, marketing_campaign)).T

column_names = ["P1", "P2", "P3", "Seasonality", "Marketing Campaign"]

data = pd.DataFrame(data, columns=column_names)
drbenvincent commented 1 week ago

Thanks @KK8627. So I was thinking of something more like this which focusses on the covariance structure between sales of products. Note: this just simulates the pre-intervention period and the post-intervention counterfactual - it hasn't added on any causal impact of the intervention yet.

This example also doesn't have any other predictor variables which influence product sales, though a more rich example could do (e.g. trend or seasonality components).

mu = [0, 1, 2]
cov = [[1, 0.5, 0.3], 
       [0.5, 1, 0.4], 
       [0.3, 0.4, 1]]
num_points = 52*2 

data = np.random.multivariate_normal(mu, cov, num_points)

df = pd.DataFrame(data, columns=['Product 1', 'Product 2', 'Product 3'])
Screenshot 2024-06-18 at 20 42 10 Screenshot 2024-06-18 at 20 41 18

I would have guessed that the covariance structure would have been a more important feature.

I got a bit confused by your example - y here which you call the target variable is causally influencing x1, but not x2, and both x1 and x2 are your multivariate outcomes (e.g. sales)? Just trying to make sure I've got a good understanding before diving in and implementing.

KK8627 commented 1 week ago

Thanks for the catch @drbenvincent. Apologies, yes all variables were product sales (now edited column headers above). It seems that my brain is always looking to separate these. Also added seasonality and marketing variables for predictors to my original dummy dataset. For clarity, I was trying to create a time-series example with Granger causality where y (or sales of P3) impacts x1 (Sales of P1 but not P2). On further thought, I also think that it is unlikely that a product's sales within the same portfolio would directly influence another product's sales within the same time step (unless they're competing products). Since the intention is to showcase the PyMC's ability to handle causal relationships, this might not be the most realistic scenario for our specific use case. 

Your suggestion to focus on the covariance structure through the simulated data is a better fit for this use case. It would allow us to isolate the impact of covariance. Here's your example with an addition of the seasonality and marketing effects. Looking forward to your thoughts.

num_points = 100

mu = [5, 3, 2] 

cov = [[1, 0.5, 0.3],
       [0.5, 1, 0.4],
       [0.3, 0.4, 1]]

data = np.random.multivariate_normal(mu, cov, num_points)

seasonality_effect = np.sin(np.pi * np.random.rand(num_points))  # Values between -1 and 1

marketing_effect = np.random.choice([-0.2, 0.3], size=num_points)  # Negative or positive impact

data = np.hstack((data, seasonality_effect.reshape(-1, 1), marketing_effect.reshape(-1, 1)))

df = pd.DataFrame(data, columns=['Product 1', 'Product 2', 'Product 3', 'Seasonality Effect', 'Marketing Effect'])

print(df)
KK8627 commented 1 week ago

@drbenvincent - I also wanted to explain what I am trying to achieve out of this (I should have probably started with this note..lol). What I am looking to analyse is the impact of larger product purchases from a portfolio and its impact (beneficial or detrimental) on a smaller products within the same portfolio, and hence I assigned the Granger causality to only one of the two sales variables in the original dummy dataset but it could technically impact both. Here are some situational examples for context: 1) Subscription services where the main service purchase impacts smaller products like movie rentals (might decrease), TV show purchases (decrease) or streaming devices (might increase or have no change). 2) Banking services where the main product is the premium checking account and smaller products influenced are debit card usage (might increase), overdraft protection (no change).  3) CPG products where the main product could be a high-end coffee machine and smaller products influenced would include coffee beans (increase), flavored syrups, creamers (increase).

I'm not sure if anyone can create an ideal dataset matching all the situations above, but I do believe that the methodology can be applied to product bundling solutions, offering complementary services, managing price differentials or offering product substitutes as a business solution coming out of the model results. Please feel free to disagree and I look forward to your thoughts. Thanks again!

drbenvincent commented 1 week ago

Thanks @KK8627, this is very useful context. Having worked on related kinds of things on client projects I have some thoughts.

Very happy to consider the latter, but I think that would be a different GitHub issue which can exclusively focus on that. Let me which of the two options above sound more in line with your particular goals.

charlesdtdb commented 1 week ago

Thanks, good to know this might be of use @charlesdtdb. Can I just check, by continuous data, you mean that impressions are thought of as a graded/continuous treatment or causal driver of your other observations?

My initial thoughts are that that problem would warrant a novel model. I have some ideas.

Happy to either talk through to see if you'd like to create a custom model and submit a PR. Though if this is high priority/value, then it could be useful to consider a consulting engagement.

hi @drbenvincent , thanks for your reply. I d love to chat and try to build a custom model. I am not a Bayesian expert, I am might be short of expertise to be able to do the PR request but it could be a good way for me to learn.

Impressions are indeed being considered as a continuous treatment, meaning I am examining how different levels of impressions (rather than just the presence or absence) affect your outcome variables ( website traffic through other channels)

KK8627 commented 1 week ago

Thanks @drbenvincent. It's most certainly a graded intervention (route 2) . I think we can consider route 1 in case of situations like limited time offers...but for most situations, I agree with the recommendation to take the main product sales as the primary target and then analyze reductions and lag in the other products. Certainly happy to move this to a new issue or as your see fit. This is a very helpful conversation and your inputs are greatly appreciated. Thanks again!