microsoft / causica

MIT License
445 stars 55 forks source link

How to make nonlinear predictions on categorical variables in causal inferece given causal graph? #88

Closed wenyaoliu closed 9 months ago

wenyaoliu commented 11 months ago

I was learning causal inference and discovery these days and have suffered from this question for a long time.

From my understanding of the literature, causal inference seems quite different from traditional machine learning. For traditional machine learning, once the model is trained, and given a set of X, the model directly predicts Y's value.

However, for the causal inference, the model answers if X1 changes from 1 to 2, for example, it will return the causal effects on Y.

So how can I answer the prediction question using causal inference?

Here are some simulated data using this graph:

import networkx as nx
import matplotlib.pyplot as plt
# Create a directed graph
G = nx.DiGraph()
# Add nodes X, Y, and Z
G.add_nodes_from(['X', 'Y', 'Z'])
# Add edges representing causal relationships
G.add_edge('X', 'Y')
G.add_edge('Z', 'Y')
# Draw the graph
pos = nx.spring_layout(G)
nx.draw_networkx(G, pos, with_labels=True, node_color='lightblue', node_size=500, font_size=12, edge_color='gray')
plt.title('Causal Graph')
plt.show()

image

# Create the nonlinear relationships:
import numpy as np
import pandas as pd

# Generate X values
X = np.linspace(0, 10, 100)

# Generate Z values
Z = np.linspace(10, 20, len(X))

# Generate Y values using a non-linear relationship with X
Y = np.sin(X) + np.cos(Z) + np.random.normal(0, 0.1, len(X))

# Combine X, Z, Y into one pandas frame
# Combine X, Z, Y into one pandas DataFrame
df = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})

# Print the DataFrame
print(df)

X Z Y 0 0.00000 10.00000 -0.781419 1 0.10101 10.10101 -0.691126 2 0.20202 10.20202 -0.603684 3 0.30303 10.30303 -0.418206 4 0.40404 10.40404 -0.087543 .. ... ... ... 95 9.59596 19.59596 0.748085 96 9.69697 19.69697 0.455545 97 9.79798 19.79798 0.365235 98 9.89899 19.89899 0.023566 99 10.00000 20.00000 -0.193462

[100 rows x 3 columns]

# plot the data
import matplotlib.pyplot as plt

# Plot X, Y, and Z
plt.plot(df['X'], label='X')
plt.plot(df['Y'], label='Y')
plt.plot(df['Z'], label='Z')

# Add labels and legend
plt.xlabel('Index')
plt.ylabel('Value')
plt.legend()

# Show the plot
plt.show()

image

The problem is:

How to make predictions on Y when X = 10, Z = 20 pretending that you only know the causal graph but not the detailed causal function?

I have tried using microsoft causia to identify the causal graph. And also the causal inference, but they are not prediction problems.

WenboGong commented 9 months ago

Hi, thanks for your question. So in general there are many possible ways to do this. One excellent reference book is Element of Causal Inference (https://mitpress.mit.edu/9780262037310/elements-of-causal-inference/). To answer your question, for example, given the graph, you can use do-calculus to simply $p(Y|do(X))$ to conditional probabilities and estimate it.

Or you can fit an structural equation model (SEM) to your generated data, and use the mutilated graph (i.e. by cutting out all the incoming edges to the intervened variable) and the corresponding SEM to estimate the intervention distribution $p(Y|do(X))$.

For the latter case, you can fit DECI model with prior graph as the true graph, and estimate the intervention distribution. If you work with the mutilated graph, it is the same as prediction problem.

For estimating intervention distribution with Causica, please see the ATE estimation in https://github.com/microsoft/causica/blob/main/examples/multi_investment_sales_attribution.ipynb