py-why / causal-learn

Causal Discovery in Python. It also includes (conditional) independence tests and score functions.
https://causal-learn.readthedocs.io/en/latest/
MIT License
1.04k stars 174 forks source link

KCI isnan function not supported #139

Closed priamai closed 8 months ago

priamai commented 9 months ago

Hi there, this is my dataset, I just show a small sample of it (there are no Nans):

array([[12066.0, 333.0, 5.0, 58.97, True, 823.87, 32.0, 25.7459375, True,
        1258.0, 3279.0, True, False],
       [0.0, 0.0, 0.0, 0.0, True, 2340.1, 82.0, 28.53780488, True,
        2500.0, 6794.0, True, False],
       [17581.0, 308.0, 10.0, 90.79314738, True, 872.43, 9.0,
        96.93666667, True, 0.0, 0.0, True, False],
       [967.0, 33.0, 18.0, 9.17, True, 1948.66, 78.0, 24.98282051, True,
        2258.0, 7667.0, True, True],
       [61312.0, 866.0, 21.0, 226.31438443, True, 0.0, 0.0, 0.0, True,
        892.0, 950.0, True, False],
       [40884.0, 679.0, 11.0, 219.87447185, True, 515.49, 6.0, 85.915,
        True, 1177.0, 1282.0, True, False],
       [63614.0, 826.0, 22.0, 260.15397368, True, 0.0, 0.0, 0.0, True,
        0.0, 0.0, True, False],
       [4226.0, 162.0, 7.0, 52.58, True, 172.7, 5.0, 34.54, True, 712.0,
        1552.0, True, False],
       [0.0, 0.0, 0.0, 0.0, True, 1389.53, 43.0, 32.31465116, True,
        1010.0, 5022.0, True, False],
       [41981.0, 1258.0, 6.0, 315.74796842, True, 514.76, 10.0, 51.476,
        True, 1084.0, 1176.0, True, False]], dtype=object)

There are no NANS:

IMPRESSIONS     0
CLICKS          0
CONVERSIONS     0
AD_SPEND_USD    0
IS_MARKETING    0
REVENUE_USD     0
NUM_ORDERS      0
AOV_USD         0
IS_REVENUE      0
VISITORS        0
SESSIONS        0
IS_TRAFFIC      0
IS_CAMPAIGN     0
dtype: int64

But when I run:

from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import kci
dataset= X.to_numpy()
sub_cols = X.columns
cg = pc(dataset, alpha = 0.05, indep_test='kci')

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[219], line 5
      3 dataset= X.to_numpy()
      4 sub_cols = X.columns
----> 5 cg = pc(dataset, alpha = 0.05, indep_test='kci')

File /opt/conda/lib/python3.10/site-packages/causallearn/search/ConstraintBased/PC.py:46, in pc(data, alpha, indep_test, stable, uc_rule, uc_priority, mvpc, correction_name, background_knowledge, verbose, show_progress, node_names, **kwargs)
     41     return mvpc_alg(data=data, node_names=node_names, alpha=alpha, indep_test=indep_test, correction_name=correction_name, stable=stable,
     42                     uc_rule=uc_rule, uc_priority=uc_priority, background_knowledge=background_knowledge,
     43                     verbose=verbose,
     44                     show_progress=show_progress, **kwargs)
     45 else:
---> 46     return pc_alg(data=data, node_names=node_names, alpha=alpha, indep_test=indep_test, stable=stable, uc_rule=uc_rule,
     47                   uc_priority=uc_priority, background_knowledge=background_knowledge, verbose=verbose,
     48                   show_progress=show_progress, **kwargs)

File /opt/conda/lib/python3.10/site-packages/causallearn/search/ConstraintBased/PC.py:103, in pc_alg(data, node_names, alpha, indep_test, stable, uc_rule, uc_priority, background_knowledge, verbose, show_progress, **kwargs)
     64 """
     65 Perform Peter-Clark (PC) algorithm for causal discovery
     66 
   (...)
     99 
    100 """
    102 start = time.time()
--> 103 indep_test = CIT(data, indep_test, **kwargs)
    104 cg_1 = SkeletonDiscovery.skeleton_discovery(data, alpha, indep_test, stable,
    105                                             background_knowledge=background_knowledge, verbose=verbose,
    106                                             show_progress=show_progress, node_names=node_names)
    108 if background_knowledge is not None:

File /opt/conda/lib/python3.10/site-packages/causallearn/utils/cit.py:34, in CIT(data, method, **kwargs)
     32     return FisherZ(data, **kwargs)
     33 elif method == kci:
---> 34     return KCI(data, **kwargs)
     35 elif method in [chisq, gsq]:
     36     return Chisq_or_Gsq(data, method_name=method, **kwargs)

File /opt/conda/lib/python3.10/site-packages/causallearn/utils/cit.py:183, in KCI.__init__(self, data, **kwargs)
    178 kci_ci_kwargs = {k: v for k, v in kwargs.items() if k in
    179                  ['kernelX', 'kernelY', 'kernelZ', 'null_ss', 'approx', 'use_gp', 'est_width', 'polyd',
    180                   'kwidthx', 'kwidthy', 'kwidthz']}
    181 self.check_cache_method_consistent(
    182     'kci', hashlib.md5(json.dumps(kci_ci_kwargs, sort_keys=True).encode('utf-8')).hexdigest())
--> 183 self.assert_input_data_is_valid()
    184 self.kci_ui = KCI_UInd(**kci_ui_kwargs)
    185 self.kci_ci = KCI_CInd(**kci_ci_kwargs)

File /opt/conda/lib/python3.10/site-packages/causallearn/utils/cit.py:81, in CIT_Base.assert_input_data_is_valid(self, allow_nan, allow_inf)
     80 def assert_input_data_is_valid(self, allow_nan=False, allow_inf=False):
---> 81     assert allow_nan or not np.isnan(self.data).any(), "Input data contains NaN. Please check."
     82     assert allow_inf or not np.isinf(self.data).any(), "Input data contains Inf. Please check."

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
MarkDana commented 9 months ago

Hi @priamai Your data's dtype is object. Could you please convert it to numerics, e.g., data = data.astype(float), and try it again?

priamai commented 9 months ago

It is an array of float and booleans as you can see from my code. If I convert also the booleans to float, then it will assume is not categorical I think.

MarkDana commented 9 months ago

Generally, mixed-type data is not well supported yet in causal-learn. You may try to use Tetrad java or py-Tetrad implementations. See https://github.com/py-why/causal-learn/issues/31 and https://github.com/py-why/causal-learn/issues/43.

Specifically in your case, since there are only two categories (boolean), converting them to floats (0, 1) will not be an issue, and all the CI estimation results will hold the same.

In either way, setting your array's whole data type as object is not a good idea here. Your may try ints and floats at different columns.

priamai commented 9 months ago

Hello Mark, oh that is a very relevant discussion, thanks for the hard work, I will start to learn py-tetrad as well. I converted to float as suggested, I am getting some runtime warnings as below:

image

I am not sure what are double scalars.

MarkDana commented 8 months ago

This seems to be a division by zero warning. Could you have a look at the raw data and check whether there are constant variables (e.g., some column is True in all samples)?

priamai commented 8 months ago

Hello, you are damn right, some of the variables are static so I should remove them.

IMPRESSIONS     378
CLICKS          313
CONVERSIONS      62
AD_SPEND_USD    375
IS_MARKETING      1
REVENUE_USD     411
NUM_ORDERS       92
AOV_USD         409
IS_REVENUE        1
VISITORS        388
SESSIONS        431
IS_TRAFFIC        1
IS_WEEKEND        2
IS_CAMPAIGN       2