Documentation on parameters of CMIsymb is lacking

shaypal5 commented 5 years ago

Hey there, Jakob and tigramite contributors! :)

I'm planning to use PCMCI for causal inference on discrete data. Since I don't want to assume any model for causality flow, I chose the non-parametric conditional mutual information test. Thus, the discrete implementation for it, the CMIsymb class, seems like the way to go.

However, CMIsymb has several parameters, the effect of which on the functionality of the test is not clear from the class documentation. This is true even after reading the paper on PCMCI, and specifically sections S2.3 and S2.4, dealing with the CMI test and its discrete variate, correspondingly. Extending the documentation can help users make better use and to decide intelligently on the right parameters for their use.

Personally, I would also love to hear some more about these just for my own use. Also, if you'll give me something to start with I can read up on it and will happily open a PR with the extended documentation.

n_symbs - I get the basic idea of assuming symbolic input and why the default is the maximum value seen in the data (plus 1), but what are the effects of giving a smaller or larger number? Is there any reason to do so? If so, is there a way to estimate that number, in those cases?
significance - Not sure what fixed_thres does.
sig_blocklength and conf_blocklength - I can see that the documentation of the abstract base class, CondIndTest refers to the paper, and that section S3.3 does go into details about this, but I think some more detail in the documentation itself can help.

I promise to read up on the sections I mentioned, and if I get my answers there to try and extend the documentation myself, but I would also love to get your help with this.

jakobrunge commented 5 years ago

Dear Shay

I will add more documentation on this soon. But actually, there aren't that many parameters.

n_symbs should in most cases be determined by the input data, i.e., set n_symbs=None. In some applications, you can choose how the data is discretized. The estimator is just based on counting the joint probabilities of symbols ocurring in the time series. The more symbols you have, the higher dimensional the data and the more difficult the estimation of PCMCI gets.

significance: This is inherited from the conditional independence base class. I recommend 'shuffle_test' since fixed_thres a fixed threshold on the CMI test statistic. This is faster, but you would have to adapt your threshold and cannot guarantee a specified significance level.

Also the blocklength parameters belong to the base class and are helpful for autocorrelated data to better simulate the shuffleed null distribution. The base class chooses them optimally for sig_blocklength=None.

I am actually working on less general independence tests for discrete data that will yield more power (if the data is not nonlinear).

shaypal5 commented 5 years ago

I never answered you here, but I actually read your reply just a bit after your wrote it and found it very helpful, so thank you very much! :)

jakobrunge / tigramite

Documentation on parameters of CMIsymb is lacking #31