Falsification of given DAG: not working on simulated data?

sinya2 commented 6 months ago

Hi! I've studied new opportunity of graph falsification and tied a rather simple simulated data:

causal_graph = """graph[directed 1 
                    node[id "X" label "X"]
                    node[id "Y" label "Y"]
                    node[id "Z" label "Z"]
                    node[id "T" label "T"]
                    edge[source "T" target "X"]
                    edge[source "X" target "Y"]
                    edge[source "Z" target "Y"]
                    edge[source "Z" target "X"]]"""
with open('sample.gml', 'w') as file:
    file.write(causal_graph)

n = 1000
z = np.array(sps.norm.rvs(100,10, size = n))
t = np.array(sps.norm.rvs(100,10, size = n))
x = 2*z + 8*t+ np.array(sps.norm.rvs(10,10, size = n))
y = 5*z + 7*x + np.array(sps.norm.rvs(10,10, size = n))

df = pd.DataFrame({'X':x,'Y':y, 'Z':z, 'T':t})

g_true = nx.read_gml(f"sample.gml")

result = falsify_graph(g_true, df, plot_histogram=True,  suggestions=True)
print(result)

But surprisingly I've get LMC violations with p-value > 0.05 in 8 runs out of 10.

Am I right that this data with this graph have to pass this check?
What could go wrong?

bloebp commented 6 months ago

Hi,

Thanks for raising this interesting issue. You are right, in theory, it should pass all LMC tests. It seems to find a violation for the Y node and when I checked the underlying independence tests, it rejects the conditional independence between T and Y given X. It seems that the statistical power of the independence tests is reaching its limit here, since the noise of Y is relatively small compared to Z and T due to their coefficients. Here, Y has 56 * T, so the variance added by the noise on Y is too small in comparison. If you increase the variance of the noise or decrease the coefficients, then it should correctly find that the LMCs hold. For instance, change it to:

y = 5*z + 7*x + np.array(sps.norm.rvs(10,20, size = n))

(variance from 10 to 20)

Alternatively, maybe other independence tests can capture (in)dependencies with (relative) small variance. Since you only have linear connections, maybe you can try partial correlations instead.

Let me know if this helps!

sinya2 commented 6 months ago

Thank you very much for answer! Increasing noise for Y really helped.

Am I correctly understand, that for chains like T->X->Y I will not see conditional independence between T and Y given X if the noise brought in Y from T do not have the same range as the independent noise in Y?

I've tested a bit and only for a chain it successfully passes independent test, even with are lower Y dispersion (10->1).

But not passing for my initial model even with are Y dispersion (10->100).

bloebp commented 6 months ago

Generally in that chain T should always be independent of Y given X if the relationships are non-deterministic. So, it is only a question whether the independence test is able to capture it. The issue with the signal to noise ratio is: If it would be deterministic, e.g. in the obvious case of the identity X := T, then it's clear that T and Y are not independent given X = T. Adding noise will make it non deterministic and helps, but adding too small noise might not be enough for the independence test to capture it.

sinya2 commented 6 months ago

I've realized, that for the initial model T should not be independent of Y given X, as we are opening way through the Z by the collider in X, while controlling X.

py-why / dowhy

Falsification of given DAG: not working on simulated data? #1123