jarfo / cause-effect

Kaggle cause-effect software
12 stars 20 forks source link

cds_score: to explain default values and alternatives to ffactor=2, maxdev=3, minc=12 #1

Open muoten opened 3 years ago

muoten commented 3 years ago

I found conditional_distribution_similarity implementation via https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.py

The method seems very interesting (and promising). Anyway, despite I've read https://arxiv.org/pdf/1601.06680.pdf I don't understand the reasons under default values for ffactor, maxdev, minc. Could you provide some brief explanation and suggested ranges or alternatives for these parameters?

Thanks!

jarfo commented 3 years ago

Hi

ffactor and maxdev controls the discretization (quantization) process (ffactor controls the resolution and maxdev the outliers). minc is a threshold on the minimum number of samples for a given x discrete value or label. A x value that appears less than minc times is ignored

If you have a very large number of samples you try can to increase ffactor and maxdev to improve the discretization, and minc to ignore rare x values

Regards Jose Fonollosa

On 10 Nov 2020, at 13:40, muoten notifications@github.com wrote:

I found conditional_distribution_similarity implementation via https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.py https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.py The method seems very interesting (and promising). Anyway, despite I've read https://arxiv.org/pdf/1601.06680.pdf https://arxiv.org/pdf/1601.06680.pdf I don't understand the reasons under default values for ffactor, maxdev, minc. Could you provide some brief explanation and suggested ranges or alternatives for these parameters?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jarfo/cause-effect/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOC7EXLAHJWM3EADEIXDB3SPEYDVANCNFSM4TQTRR5Q.

muoten commented 3 years ago

Thank you very much for your response!

According to the discretized_sequence method it seems normal distribution for numerical variables is assumed. Or at least they are standardized (to have zero mean and unit variance) as if they were normal:

I copy from https://github.com/jarfo/cause-effect/blob/master/features.py#L89:

def discretized_sequence(x, tx, ffactor, maxdev, norm=True):
    if not norm or (numerical(tx) and count_unique(x) > len_discretized_values(x, tx, ffactor, maxdev)):
        if norm:
            x = (x - np.mean(x))/np.std(x)
            xf = x[abs(x) < maxdev]
            x = (x - np.mean(xf))/np.std(xf)
        x = np.round(x*ffactor)
        vmax =  ffactor*maxdev
        vmin = -ffactor*maxdev
        x[x > vmax] = vmax
        x[x < vmin] = vmin
    return x

Regards

muoten commented 3 years ago

In relation to my previous questions, I show a particular sensitivity analysis for a given example. For convenience I used the implementation of conditional_distribution_similarity from cdt package:

Ground truth for http://webdav.tuebingen.mpg.de/cause-effect/pair0062: y-->x

import pandas as pd
from cdt.causality.pairwise import CDS
URL = "http://webdav.tuebingen.mpg.de/cause-effect/pair0062"
df_xy = pd.read_csv('{}.txt'.format(URL), sep='\s+', names=['x','y'], index_col=False)

def test_pairwise_cds_score(x, y, model):
    mybool = model.cds_score(x,y)<model.cds_score(y,x)
    return mybool

model1 = "CDS()"
print("y->x as {} (with {})".format(test_pairwise_cds_score(df_xy['y'], df_xy['x'], eval(model1)), model1))
# prints y->x as False (with CDS())

model2 = "CDS(ffactor=2, maxdev=3,minc=1)"
print("y->x as {} (with {})".format(test_pairwise_cds_score(df_xy['y'], df_xy['x'], eval(model2)), model2))
# prints y->x as True (with CDS(ffactor=2, maxdev=3,minc=1))

So CDS estimates for y->x change from False to True if we reduce default minc=12 to minc=1.

Moreover, quantization error between discrete filtered sequences and scaled original (sMAPE) decrease from 15.6% ({'ffactor': 2, 'maxdev': 3, 'minc': 12}) to 12.8% ({'ffactor': 2, 'maxdev': 3, 'minc': 1}). While filtered (ignored) values decrease from 6% (y) and 12% (x) to 0.

Visualizing a meshgrid with different estimations depending on ffactor and maxdev, for minc=1... y->x seems True 21/30=70% times cds_score_vs_params_minc1

And several inferences of y->x as False correspond to relatively higher discretization errors: discretization_error_vs_params_minc1

I think it would be great if CDS implementation could incorporate same kind of auto-tune mode that, for a given pair of variables, find a more suitable combination of parameters. Or at least it warns if discretization error or discarded samples exceeds a threshold.

More details in this notebook

Thanks!