Possible regression - Githubissues

paulds8 commented 4 years ago

Around two months ago I was experimenting with DoWhy. I installed from conda and set up a simple experiment with some synthetic data.

At that point in time, the estimand expression produced this: d
───────────(Expectation(Spend|Month,Transactions,Primary Store)) d[Treatment] Estimand assumption 1, Unconfoundedness: If U→{Treatment} and U→Spend then P(Spend|Treatment,Month,Transactions,Primary Store,U) = P(Spend|Treatment,Month,Transactions,Primary Store)

I recently updated the package, and now when I run the exact same code I see the following: d
────────────(Expectation(Spend)) d[Treatment]
Estimand assumption 1, Unconfoundedness: If U→{Treatment} and U→Spend then P(Spend|Treatment,,U) = P(Spend|Treatment,)

I've tested this code in your binder setup and it's producing the same as above, so I think we can rule out an environment issue on my end.

Has something changed with regards to how I need to set the DAG up perhaps? This seems to suggest the confounders are being disregarded. Is this really the case? Perhaps there's something I'm not understanding here.

Here's the graph it produces:

I've included the necessary code to reproduce the issue below.


import pandas as pd
import numpy as np
import pydot
import logging
from io import BytesIO

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

import warnings
warnings.filterwarnings("ignore")

month_cols = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

def create_spend_data():

    df = pd.DataFrame([[np.random.randint(10,10000) for x in np.arange(12)] for i in np.arange(10000)],
                columns=month_cols)

    df['Treatment'] = [np.random.choice([True, False, False, False, False, False, False, False, False, False]) for x in np.arange(10000)]
    df['Primary Store'] = [np.random.randint(0,100) for x in np.arange(10000)]

    #make Treatment members on average spend ~5% more (some more, some less... but on average 5% more)
    df.loc[df['Treatment']==True, month_cols]  = df[month_cols] * (1 + np.random.random(2).sum()/20)

    df.reset_index(inplace=True)
    df = df.rename({'index': 'Customer'}, axis=1)

    df = df.melt(id_vars=['Customer','Treatment', 'Primary Store'],
           value_vars=month_cols,
           var_name='Month', 
           value_name='Spend')

    return df

def create_transactions_data():
    df = pd.DataFrame([[np.random.randint(1,10) for x in np.arange(12)] for i in np.arange(10000)],
                columns=month_cols)

    df.reset_index(inplace=True)
    df = df.rename({'index': 'Customer'}, axis=1)

    df = df.melt(id_vars=['Customer'],
           value_vars=month_cols,
           var_name='Month', 
           value_name='Transactions')

    return df

spend = create_spend_data()
transactions = create_transactions_data()
df = spend.merge(transactions, on=['Customer', 'Month'])

graph = """digraph {
"Treatment" [label="Treatment" color="darkgreen"];
"Month" [label="Month"];
"Spend" [label="Spend" color="blue"];
"Transactions" [label="Transactions"];
"Unobserved Confounders" [label="Unobserved Confounders" color="grey"];
"Treatment" -> "Transactions";
"Transactions" -> "Spend";
"Month" -> "Transactions";
"Month" -> "Spend";
"Treatment" -> "Spend";
"Treatment" -> "Primary Store";
"Month" -> "Primary Store";
"Primary Store" -> "Transactions";
"Primary Store" -> "Spend";
"Unobserved Confounders" -> "Primary Store";
"Unobserved Confounders" -> "Treatment";
"Unobserved Confounders" -> "Transactions";
"Unobserved Confounders" -> "Spend";
}"""

pydot_graph = pydot.graph_from_dot_data(graph)

# render pydot by calling dot, no file saved to disk
png_str = pydot_graph[0].create_png(prog='dot')

# treat the dot output string as an image file
sio = BytesIO()
sio.write(png_str)
sio.seek(0)
img = mpimg.imread(sio)

# plot the image
fig = plt.figure(dpi=300, figsize=(2,2))
imgplot = plt.imshow(img, aspect='equal', interpolation='bilinear')
plt.axis('off')
plt.show(block=False)

pydot_graph[0].write_dot('graph.dot')

model = CausalModel(
    data=df,
    treatment='Treatment',
    outcome='Spend',
    graph= 'graph.dot',
    logging_level = logging.INFO
)

# Identify causal effect and return target estimands
identified_estimand = model.identify_effect()
print(identified_estimand)```

paulds8 commented 4 years ago

As a follow up; for me to get to the original formulation where the confounders are all considered I need to change the way the graph is set up to this. The previous DAG conceptually makes a lot more sense to me though.

graph = """digraph { "Unobserved Confounders" [label="Unobserved Confounders" color="grey"]; "Treatment" [label="Treatment" color="darkgreen"]; "Month" [label="Month"]; "Spend" [label="Spend" color="blue"]; "Transactions" [label="Transactions"]; "Transactions" -> "Treatment"; "Month" -> "Treatment"; "Primary Store" -> "Treatment"; "Transactions" -> "Spend"; "Month" -> "Spend"; "Primary Store" -> "Spend"; "Treatment" -> "Spend"; "Unobserved Confounders" -> "Primary Store"; "Unobserved Confounders" -> "Transactions"; "Unobserved Confounders" -> "Treatment"; "Unobserved Confounders" -> "Spend"; }"""

amit-sharma commented 4 years ago

@paulds8 this is an interesting example. In your first graph, the month variable need not be conditioned on since it lies on a back-door path that has a collider, so it is not actually confounding the estimate. Primary Store and Transactions do lead to a back-door path between treatment and outcome, so they should be conditioned on. But at the same time, they are caused by treatment and therefore are mediators, and should not be conditioned on, assuming that the goal is to find the total effect of treatment on outcome. This is what creates confusion and the library decides to not condition on them. Here's a reference to see the backdoor criterion rules: http://www.stat.cmu.edu/~cshalizi/uADA/16/lectures/23.pdf (or you can consult Pearl's Causality book).

Your second graph is cleaner and establishes a separate back-door path for each of the confounders. Assuming that the second graph also captures the relationships between your variables, I suggest to use the second graph.

paulds8 commented 4 years ago

@amit-sharma, thank you for the resource; I'll give it a read! It should definitely help with my understanding of this.

Would you mind if we unpack this particular case further? I'd really appreciate your views on it.

In this example you can think of the treatment as a customer actively using some kind of rewards/loyalty program that they can subscribe to (there's a monthly cost). Not all customers do so. Customers that do, get discounts in return which in the grander scheme of things potentially leads to more total spend. We want to measure the ATT of this program on total monthly spend over a full year.

The thinking with having the month variable is it allows us to deal with seasonal variations in shopping patterns; e.g. differences around Easter & Christmas vs the rest of the year. This acts on the control and treatment groups equally.

Using the loyalty program potentially has an effect on where people choose to shop and can impact the number of transactions for a given month; but these variables obviously also exist for shoppers that aren't on the program so we'd like to make sure the control/target matching accounts for this as we're expecting differences in total spend as a result of using the program. Where you shop and the number of transactions in the month doesn't necessarily impact whether or not you decide to join the program.

Given this explanation, does the second graph still seem reasonable to proceed with?

amit-sharma commented 4 years ago

@paulds8 sorry I missed your reply somehow. Thanks for the details---I understand the problem better. I am not sure if either of the graphs are accurate here.

Time is the most crucial variable in such a problem. Suppose the customer's signup month is i. The primary store and num_transactions in the months before i are causes of the treatment (signup), and the primary store and num_transactions in the months after i are outcomes of the treatment. So you'll need to split up your data: the easiest is to consider users who signed up in each month separately. Then compute the pre- and post- transaction activity for that month, include the pre-month activity as the confounders and compute the causal effect. Then average over the causal effects over all months (you may also find different effects of signing up in November versus May, for example)

So the correct way is to condition on the activity before the treatment (signup) as confounders. And then ignore the activity after---since it will be a mediator on the path from treatment to final outcome.

To help describe the concept, I've shared a Jupyter notebook with a similar example: https://github.com/microsoft/dowhy/blob/master/docs/source/example_notebooks/dowhy_example_effect_of_memberrewards_program.ipynb

Does this help? (When trying it, I recommend pulling the latest dowhy from Github master branch. I fixed a few bugs with the backdoor criterion.)

amit-sharma commented 4 years ago

@paulds8 checking if the reply was useful to you. Let me know if you have any more questions for this issue.

paulds8 commented 4 years ago

Apologies, I have been on leave. Thank you. This is very useful! I need to make some time to circle back to this particular case in the coming days. I'll re-open if there are any questions so we can continue a coherent conversation.

py-why / dowhy

Possible regression #166