Data correlation matrix is singular

asha24choudhary commented 11 months ago

So I might be missing some theoretical concept, but want to clear it now.

I have a dataset, take the fork scenario. My data is generated as follows:

#c) Fork

# Create the graph describing the causal structure
graph = """graph[directed 1 node[id "W" label "W"]
                    node[id "X" label "X"]
                    node[id "Y" label "Y"]
                    edge[source "X" target "Y"]
                    edge[source "X" target "W"]]""".replace('\n', '')

# # Generate the data
X = np.random.randn(N_SAMPLES)
W = 0.5*X
Y = 0.8*X  

# Data to df
df = pd.DataFrame(np.vstack([X, W, Y]).T, columns=['X', 'W', 'Y'])
print(df.head(10))
# Create a model
model = CausalModel(
    data=df,
    treatment=['X'],
    outcome=['Y'],
#     common_causes=['Z'],
    graph=graph
)
plt.figure(figsize=(5,5))
model.view_model()
plt.show()

Clearly, the rank is 1 and you can see in the fig below

When I perform causal discovery using PC, I get 'ValueError: Data correlation matrix is singular. Cannot run fisherz test. Please check your data.'

Below you can find the code which I'm using to perform causal discovery using PC.

from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import fisherz
from causallearn.utils.GraphUtils import GraphUtils

# default parameters
cg = pc(df.to_numpy(), 0.05, fisherz)

# visualization using pydot
cg.draw_pydot_graph(labels=df.columns)

# or save the graph
pyd = GraphUtils.to_pydot(cg.G, labels=df.columns)
pyd.write_png('pc_fork.png')

Need help in understanding it, although I think as the data is correlated and singular I'm getting this error, however, how can I resolve this error without adding some random noise in the variables W & Y. Isn't causal discovery possible in my case?

asha24choudhary commented 11 months ago

do u think if it was a good idea to calculate pseudo inverse, if the inverse of the sub_corr_matrix gives error?

WilliamsToTo commented 11 months ago

I also have the same issue when I use causallearn.search.ScoreBased.GES. I guess it is caused by input data. I don't know what kind of requirements should be met. It would be good if developers could list requirements for input data.

kunwuz commented 11 months ago

Yea, this is due to some violation of the data-generating process, e.g., violation of faithfulness. I don't know if any strategy exists to detect this given an observed dataset. The pseudo-inverse could be a good solution in practice, but we need to investigate deeper to see if that would introduce any issue with the asymptotic guarantee.

kunwuz commented 11 months ago

Perhaps adding some small random noises could help?

priamai commented 11 months ago

Yes and can you check two things: a) distinct count per column b) distinct count of identical rows

What I learned with repeated data, it does create singular matrix. Also interested to learn!

py-why / causal-learn

Data correlation matrix is singular #155