Open muoten opened 3 years ago
Hi
ffactor and maxdev controls the discretization (quantization) process (ffactor controls the resolution and maxdev the outliers). minc is a threshold on the minimum number of samples for a given x discrete value or label. A x value that appears less than minc times is ignored
If you have a very large number of samples you try can to increase ffactor and maxdev to improve the discretization, and minc to ignore rare x values
Regards Jose Fonollosa
On 10 Nov 2020, at 13:40, muoten notifications@github.com wrote:
I found conditional_distribution_similarity implementation via https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.py https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.py The method seems very interesting (and promising). Anyway, despite I've read https://arxiv.org/pdf/1601.06680.pdf https://arxiv.org/pdf/1601.06680.pdf I don't understand the reasons under default values for ffactor, maxdev, minc. Could you provide some brief explanation and suggested ranges or alternatives for these parameters?
Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jarfo/cause-effect/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOC7EXLAHJWM3EADEIXDB3SPEYDVANCNFSM4TQTRR5Q.
Thank you very much for your response!
According to the discretized_sequence
method it seems normal distribution for numerical variables is assumed. Or at least they are standardized (to have zero mean and unit variance) as if they were normal:
I copy from https://github.com/jarfo/cause-effect/blob/master/features.py#L89:
def discretized_sequence(x, tx, ffactor, maxdev, norm=True):
if not norm or (numerical(tx) and count_unique(x) > len_discretized_values(x, tx, ffactor, maxdev)):
if norm:
x = (x - np.mean(x))/np.std(x)
xf = x[abs(x) < maxdev]
x = (x - np.mean(xf))/np.std(xf)
x = np.round(x*ffactor)
vmax = ffactor*maxdev
vmin = -ffactor*maxdev
x[x > vmax] = vmax
x[x < vmin] = vmin
return x
maxdev=3
would be very strict for non-normal data. What should be your advice for skewed long-tail distributions?ffactor
to a large integer and set minc=0
to avoid discretization to calculate the conditional similarity score of 2 sequences sample by sample, when computationally feasible?Regards
In relation to my previous questions, I show a particular sensitivity analysis for a given example. For convenience I used the implementation of conditional_distribution_similarity
from cdt
package:
Ground truth for http://webdav.tuebingen.mpg.de/cause-effect/pair0062: y-->x
import pandas as pd
from cdt.causality.pairwise import CDS
URL = "http://webdav.tuebingen.mpg.de/cause-effect/pair0062"
df_xy = pd.read_csv('{}.txt'.format(URL), sep='\s+', names=['x','y'], index_col=False)
def test_pairwise_cds_score(x, y, model):
mybool = model.cds_score(x,y)<model.cds_score(y,x)
return mybool
model1 = "CDS()"
print("y->x as {} (with {})".format(test_pairwise_cds_score(df_xy['y'], df_xy['x'], eval(model1)), model1))
# prints y->x as False (with CDS())
model2 = "CDS(ffactor=2, maxdev=3,minc=1)"
print("y->x as {} (with {})".format(test_pairwise_cds_score(df_xy['y'], df_xy['x'], eval(model2)), model2))
# prints y->x as True (with CDS(ffactor=2, maxdev=3,minc=1))
So CDS estimates for y->x change from False to True if we reduce default minc=12
to minc=1
.
Moreover, quantization error between discrete filtered sequences and scaled original (sMAPE) decrease from 15.6% ({'ffactor': 2, 'maxdev': 3, 'minc': 12}
) to 12.8% ({'ffactor': 2, 'maxdev': 3, 'minc': 1}
). While filtered (ignored) values decrease from 6% (y
) and 12% (x
) to 0.
Visualizing a meshgrid with different estimations depending on ffactor
and maxdev
, for minc=1
... y->x seems True 21/30=70% times
And several inferences of y->x as False correspond to relatively higher discretization errors:
I think it would be great if CDS implementation could incorporate same kind of auto-tune mode that, for a given pair of variables, find a more suitable combination of parameters. Or at least it warns if discretization error or discarded samples exceeds a threshold.
More details in this notebook
Thanks!
I found
conditional_distribution_similarity
implementation via https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/master/cdt/causality/pairwise/CDS.pyThe method seems very interesting (and promising). Anyway, despite I've read https://arxiv.org/pdf/1601.06680.pdf I don't understand the reasons under default values for
ffactor
,maxdev
,minc
. Could you provide some brief explanation and suggested ranges or alternatives for these parameters?Thanks!