jakobrunge / tigramite

Tigramite is a python package for causal inference with a focus on time series data. The Tigramite documentation is at
https://jakobrunge.github.io/tigramite/
GNU General Public License v3.0
1.33k stars 278 forks source link

Documentation on parameters of CMIsymb is lacking #31

Closed shaypal5 closed 5 years ago

shaypal5 commented 5 years ago

Hey there, Jakob and tigramite contributors! :)

I'm planning to use PCMCI for causal inference on discrete data. Since I don't want to assume any model for causality flow, I chose the non-parametric conditional mutual information test. Thus, the discrete implementation for it, the CMIsymb class, seems like the way to go.

However, CMIsymb has several parameters, the effect of which on the functionality of the test is not clear from the class documentation. This is true even after reading the paper on PCMCI, and specifically sections S2.3 and S2.4, dealing with the CMI test and its discrete variate, correspondingly. Extending the documentation can help users make better use and to decide intelligently on the right parameters for their use.

Personally, I would also love to hear some more about these just for my own use. Also, if you'll give me something to start with I can read up on it and will happily open a PR with the extended documentation.

I promise to read up on the sections I mentioned, and if I get my answers there to try and extend the documentation myself, but I would also love to get your help with this.

jakobrunge commented 5 years ago

Dear Shay

I will add more documentation on this soon. But actually, there aren't that many parameters.

n_symbs should in most cases be determined by the input data, i.e., set n_symbs=None. In some applications, you can choose how the data is discretized. The estimator is just based on counting the joint probabilities of symbols ocurring in the time series. The more symbols you have, the higher dimensional the data and the more difficult the estimation of PCMCI gets.

significance: This is inherited from the conditional independence base class. I recommend 'shuffle_test' since fixed_thres a fixed threshold on the CMI test statistic. This is faster, but you would have to adapt your threshold and cannot guarantee a specified significance level.

Also the blocklength parameters belong to the base class and are helpful for autocorrelated data to better simulate the shuffleed null distribution. The base class chooses them optimally for sig_blocklength=None.

I am actually working on less general independence tests for discrete data that will yield more power (if the data is not nonlinear).

shaypal5 commented 5 years ago

I never answered you here, but I actually read your reply just a bit after your wrote it and found it very helpful, so thank you very much! :)