jakobrunge / tigramite

Tigramite is a python package for causal inference with a focus on time series data. The Tigramite documentation is at
https://jakobrunge.github.io/tigramite/
GNU General Public License v3.0
1.29k stars 273 forks source link

Integrate CDT in Tigramite #132

Closed yellowsloth closed 3 years ago

yellowsloth commented 3 years ago

Dear Jackob,

First of all, I would like to thank you for your work. I find admirable your effort in making Tigramite usable for the whole community.

I would like to extend the causal discovery algorithms proposed in the causal discovery toolbox framework (CDT [https://github.com/FenTechSolutions/CausalDiscoveryToolbox]) in Tigramite to allow not only the choice of independence tests but also the causal skeleton discovery algorithm. This will help in building benchmarks for the community. My aims are two:

  1. Extend the algorithms for causal discovery proposed in the CDT to time-series while keeping the Tigramite framework.
  2. Use the MCI step on the output of the CDT algorithms.

To add an external independence test I can easily adjust the file "_independence_testsbase.py" on the function "_runtest". Instead changing the pc_stable algorithm with FCI (for example) require more caution. I was therefore looking for some confirmation.

Let me try to give an idea of what I was looking for points 1) and 2): Given a set of time series {ABCD} of length N=1000 points and a TauMax= 20: 1) Create an array of time series as {A,A',A''...A^20..B,B'B'',B''...B^20...D,D',D''..D^20} where the " ' " represent the temporally shifted time series of 1,2,3...20. The resulting array will be *[N_variablesTauMax,N]**. Run the CDT algorithms on these shifted time series and extract the result of the causal dependency between variables and shifted variables. Convert the outcome into a dictionary of form {0: [(3, -2), ...], 1:[], ...}. For example, if the GES algorithm finds a link between variable 0 and variable 3 shifted by 2 I can create a dictionary where I set 0: [(3, -2)] as a meaningful link, moreover, I can set dummy values for the p-value and confidence index while keeping Tigramite for plotting functions or other methods. 2) The idea would be to give the dictionary extracted in step one to the "_runmci" function. However, you can't run MCI if you don't have the real confidence indexes, p-values and conditioning sets. Also, I'm not sure it makes sense to compute the MCI step on the output of a non-constraint based algorithm as GES for example. I think the only way to do this is to run again the PCusing only selected links from point 1 and then run the mci on the output of the PC: Create matrix -> GES -> get dictionary of selected links -> run pcmci with selected links. Thank's for your advice

jakobrunge commented 3 years ago

I don't quite understand the goal of your proposal. What algorithms in CDT do you have in mind that are not already in Tigramite? Note that LPCMCI is a better algorithm for latent causal discovery than FCI and is part of Tigramite already.

Ad 1) GES would indeed be an idea. I have no idea how GES performs on autocorrelated time series and I don't know whether your suggestion in point 1 would work regarding stationarity assumptions etc.

Ad 2) In any case, if GES is adapted to time series and contemporaneous links are ignored, then GES gives a DAG and MCI can readily be applied by just using the parents from GES as lagged_parents (no p-values needed). In other cases it's not straightforward to me. For PCMCI+ the contemporaneous skeleton phase only requires lagged (preliminary) parents and not confidence values, so that could similarly be used I'd say.

yellowsloth commented 3 years ago

Thank's for your quick reply I didn't look at the new version of the framework. As you suggested 1) I can integrate new causal discovery algorithms just like it was done for LPCMCI. 2) If no p-values and confidence indexes are needed for MCI I can easily do some tests and check for improvements.

Regarding your concerns about the proposal: Lately, many new algorithms with different approaches for causal discovery have been proposed but they are not supported by code or their extensions to ts is not clear. Keeping up with new algorithms, understanding their assumptions, implementing and integrating them is a long process. Also, it's difficult to have enough time/competence to do everything and doing it correctly. Since I am using causal discovery as a feature selection I was interested in trying more algorithms with different approaches and delegate the evaluation of the best model to the prediction accuracy. The extension to an algorithm's collection framework would allow updated algorithms and fast integration and also a more reliable benchmark. As you noted, the application of many of the proposed algorithms is not so trivial and requires a more in-depth study of the methods. In general, I would like to compare the following algorithms: Granger, ts-Lingam, GES, CD-NOD, GOLEM-NV, MPIR as well as the winning methods of the Causality 4 Climate contest.

Over the next week/month I'm going to make some attempts, then if you think it might be useful and I'm satisfied with the work done, I might try to provide a communication interface with the Tigramite framework. Thanks again