vnmabus / dcor

Distance correlation and related E-statistics in Python
https://dcor.readthedocs.io
MIT License
145 stars 26 forks source link

Range of distance correlation #64

Closed AbhiPawar5 closed 6 months ago

AbhiPawar5 commented 6 months ago

Bug description summary

Hi team, thank you for creating and maintaining dcor, I have found this extremely easy to use.

My question is - What is the range of dcor? I see values beyond 1.0.

Code to reproduce the bug

def distance_correlation(df):
    # Filter df to select numerical columns only
    numerical_cols = df.select_dtypes(include=['number']).columns
    correlations = {}

    for col1, col2 in combinations(numerical_cols, 2):
        dist_corr = dcor.distance_correlation(df[col1].values, df[col2].values)
        correlations[(col1, col2)] = dist_corr

    return correlations

Expected result

The correlation range between 0 and 1.

Actual result

When I pass my dataframe, I see one of the values of dcor as 1.393317

Traceback (if an exception is raised)

No response

Software versions

dcor==0.6 MacOs : 12.5.1

Additional context

No response

vnmabus commented 6 months ago

That should definitely not happen. Are you able to provide a minimal reproducible example?

AbhiPawar5 commented 6 months ago

Hi @vnmabus, I wish I could but it is one of our internal demo datasets. Sorry. But I can share that this correlation of 1.39 occurs on CUSTOMER_ID and ACCOUNT_NUMBER which have unique 1:1 mapping.

vnmabus commented 6 months ago

In that case, the dcor should be 1. How many samples do you have? Can you reproduce the error by altering the dataset? What happens if you reduce the number of samples or change their values? Can you find a shareable set in which the problem is still present?

AbhiPawar5 commented 6 months ago

Here's the dataset attached. Can you please take a look?

DEMO1.profiles_data.sample_data copy.csv

vnmabus commented 6 months ago

I cannot reproduce it in the current develop version:

import numpy as np
import dcor

a = np.loadtxt("DEMO1.profiles_data.sample_data.copy.csv", skiprows=1, delimiter=",", usecols=(0, 1))

dcor.distance_correlation(a[:, 0], a[:, 1])
0.03820843330489932

Please, check if this code has the same results in your machine, with your installed dcor version. If not, please try also with the develop version. Maybe it is just some bug fixed but not yet released.

AbhiPawar5 commented 6 months ago

Yes, I see the similar result as yours 0.0382084333048913 BUT, if you do

a = np.loadtxt("DEMO1.profiles_data.sample_data.copy.csv", skiprows=1, delimiter=",", usecols=(0, 1))
dcor.distance_correlation(df['ACCOUNTNUMBER'].values, df['CUSTOMERID'].values)

you will see 1.3933173158759997

It has something to do when we pass dataframe?!

vnmabus commented 6 months ago

You are not really passing a dataframe (you are calling values, obtaining the internal NumPy array instead). Could you compare the values of (df['ACCOUNTNUMBER'].values, df['CUSTOMERID'].values) with the ones in (a[:, 0], a[:, 1])? Maybe the dtype differs, or something like that?

AbhiPawar5 commented 6 months ago

Found the issue. We get different results depending on the datatype of individual array elements. If you add the following line to your code shared above, you can get dcor greater than 1.0

a = a.astype(int)
vnmabus commented 6 months ago

Then it may be related with #59. That is fixed (at least partially) in the develop version, and indeed it works on my machine. In any case, it is better to cast them to float, as that is the optimized path.

AbhiPawar5 commented 6 months ago

I understand that casting them to float is optimised for operations but the results are very apart! 0.038 vs 1.39 just because of change in data types is what I didn't understand fully.

Then the question is - Which result should I trust? 0.038 or 1.39?

vnmabus commented 6 months ago

0.038 is the right one. The other may be because of an overflow in the intermediate computations, which does not happen in current develop.