Closed AbhiPawar5 closed 6 months ago
That should definitely not happen. Are you able to provide a minimal reproducible example?
Hi @vnmabus, I wish I could but it is one of our internal demo datasets. Sorry. But I can share that this correlation of 1.39 occurs on CUSTOMER_ID and ACCOUNT_NUMBER which have unique 1:1 mapping.
In that case, the dcor should be 1. How many samples do you have? Can you reproduce the error by altering the dataset? What happens if you reduce the number of samples or change their values? Can you find a shareable set in which the problem is still present?
Here's the dataset attached. Can you please take a look?
I cannot reproduce it in the current develop
version:
import numpy as np
import dcor
a = np.loadtxt("DEMO1.profiles_data.sample_data.copy.csv", skiprows=1, delimiter=",", usecols=(0, 1))
dcor.distance_correlation(a[:, 0], a[:, 1])
0.03820843330489932
Please, check if this code has the same results in your machine, with your installed dcor version. If not, please try also with the develop version. Maybe it is just some bug fixed but not yet released.
Yes, I see the similar result as yours 0.0382084333048913
BUT, if you do
a = np.loadtxt("DEMO1.profiles_data.sample_data.copy.csv", skiprows=1, delimiter=",", usecols=(0, 1))
dcor.distance_correlation(df['ACCOUNTNUMBER'].values, df['CUSTOMERID'].values)
you will see 1.3933173158759997
It has something to do when we pass dataframe?!
You are not really passing a dataframe (you are calling values
, obtaining the internal NumPy array instead). Could you compare the values of (df['ACCOUNTNUMBER'].values, df['CUSTOMERID'].values)
with the ones in (a[:, 0], a[:, 1])
? Maybe the dtype differs, or something like that?
Found the issue. We get different results depending on the datatype of individual array elements. If you add the following line to your code shared above, you can get dcor greater than 1.0
a = a.astype(int)
Then it may be related with #59. That is fixed (at least partially) in the develop version, and indeed it works on my machine. In any case, it is better to cast them to float, as that is the optimized path.
I understand that casting them to float is optimised for operations but the results are very apart! 0.038 vs 1.39 just because of change in data types is what I didn't understand fully.
Then the question is - Which result should I trust? 0.038
or 1.39
?
0.038 is the right one. The other may be because of an overflow in the intermediate computations, which does not happen in current develop.
Bug description summary
Hi team, thank you for creating and maintaining dcor, I have found this extremely easy to use.
My question is - What is the range of dcor? I see values beyond 1.0.
Code to reproduce the bug
Expected result
The correlation range between 0 and 1.
Actual result
When I pass my dataframe, I see one of the values of dcor as 1.393317
Traceback (if an exception is raised)
No response
Software versions
dcor==0.6 MacOs : 12.5.1
Additional context
No response