cds_score: to explain default values and alternatives to ffactor=2, maxdev=3, minc=12

I found conditional_distribution_similarity implementation via https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.py

The method seems very interesting (and promising). Anyway, despite I've read https://arxiv.org/pdf/1601.06680.pdf I don't understand the reasons under default values for ffactor, maxdev, minc. Could you provide some brief explanation and suggested ranges or alternatives for these parameters?

Thanks!

ffactor and maxdev controls the discretization (quantization) process (ffactor controls the resolution and maxdev the outliers). minc is a threshold on the minimum number of samples for a given x discrete value or label. A x value that appears less than minc times is ignored

If you have a very large number of samples you try can to increase ffactor and maxdev to improve the discretization, and minc to ignore rare x values

Regards Jose Fonollosa

On 10 Nov 2020, at 13:40, muoten notifications@github.com wrote:

I found conditional_distribution_similarity implementation via https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.py https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.py The method seems very interesting (and promising). Anyway, despite I've read https://arxiv.org/pdf/1601.06680.pdf https://arxiv.org/pdf/1601.06680.pdf I don't understand the reasons under default values for ffactor, maxdev, minc. Could you provide some brief explanation and suggested ranges or alternatives for these parameters?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jarfo/cause-effect/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOC7EXLAHJWM3EADEIXDB3SPEYDVANCNFSM4TQTRR5Q.

Thank you very much for your response!

According to the discretized_sequence method it seems normal distribution for numerical variables is assumed. Or at least they are standardized (to have zero mean and unit variance) as if they were normal:

I copy from https://github.com/jarfo/cause-effect/blob/master/features.py#L89:

def discretized_sequence(x, tx, ffactor, maxdev, norm=True):
    if not norm or (numerical(tx) and count_unique(x) > len_discretized_values(x, tx, ffactor, maxdev)):
        if norm:
            x = (x - np.mean(x))/np.std(x)
            xf = x[abs(x) < maxdev]
            x = (x - np.mean(xf))/np.std(xf)
        x = np.round(x*ffactor)
        vmax =  ffactor*maxdev
        vmin = -ffactor*maxdev
        x[x > vmax] = vmax
        x[x < vmin] = vmin
    return x

Then maxdev=3 would be very strict for non-normal data. What should be your advice for skewed long-tail distributions?
Moreover, would it make sense to rise ffactor to a large integer and set minc=0 to avoid discretization to calculate the conditional similarity score of 2 sequences sample by sample, when computationally feasible?

Regards

In relation to my previous questions, I show a particular sensitivity analysis for a given example. For convenience I used the implementation of conditional_distribution_similarity from cdt package:

Ground truth for http://webdav.tuebingen.mpg.de/cause-effect/pair0062: y-->x

import pandas as pd
from cdt.causality.pairwise import CDS
URL = "http://webdav.tuebingen.mpg.de/cause-effect/pair0062"
df_xy = pd.read_csv('{}.txt'.format(URL), sep='\s+', names=['x','y'], index_col=False)

def test_pairwise_cds_score(x, y, model):
    mybool = model.cds_score(x,y)<model.cds_score(y,x)
    return mybool

model1 = "CDS()"
print("y->x as {} (with {})".format(test_pairwise_cds_score(df_xy['y'], df_xy['x'], eval(model1)), model1))
# prints y->x as False (with CDS())

model2 = "CDS(ffactor=2, maxdev=3,minc=1)"
print("y->x as {} (with {})".format(test_pairwise_cds_score(df_xy['y'], df_xy['x'], eval(model2)), model2))
# prints y->x as True (with CDS(ffactor=2, maxdev=3,minc=1))

So CDS estimates for y->x change from False to True if we reduce default minc=12 to minc=1.

Moreover, quantization error between discrete filtered sequences and scaled original (sMAPE) decrease from 15.6% ({'ffactor': 2, 'maxdev': 3, 'minc': 12}) to 12.8% ({'ffactor': 2, 'maxdev': 3, 'minc': 1}). While filtered (ignored) values decrease from 6% (y) and 12% (x) to 0.

Visualizing a meshgrid with different estimations depending on ffactor and maxdev, for minc=1... y->x seems True 21/30=70% times cds_score_vs_params_minc1

And several inferences of y->x as False correspond to relatively higher discretization errors: discretization_error_vs_params_minc1

I think it would be great if CDS implementation could incorporate same kind of auto-tune mode that, for a given pair of variables, find a more suitable combination of parameters. Or at least it warns if discretization error or discarded samples exceeds a threshold.

More details in this notebook

Thanks!

jarfo / cause-effect

cds_score: to explain default values and alternatives to ffactor=2, maxdev=3, minc=12 #1