Seemingly incorrect results with `int` datatype

eberhard-leander commented 10 months ago

While experimenting with this package, I encountered a strange issue and thought it would be useful to post about it here. In short, it appears that the distance_correlation computation for int dtypes is incorrect when the size of the data is sufficiently large.

Here is a minimal example that can be used to replicate the issue:

import numpy as np
from dcor import distance_correlation

def reproduce_error(n):
    # some simple data
    arr1 = np.array([
        1, 2, 3
    ]*n)
    arr2 = np.array([
        10, 20, 5
    ]*n)

    int_int = distance_correlation(
        arr1, arr2
    )
    float_int = distance_correlation(
        arr1.astype(float),
        arr2
    )
    int_float = distance_correlation(
        arr1,
        arr2.astype(float),
    )
    float_float = distance_correlation(
        arr1.astype(float),
        arr2.astype(float),
    )

    print(f"""
    n: {n}
    int vs int: {int_int}
    float vs int: {float_int}
    int vs float: {int_float}
    float vs float: {float_float}
    """)

Now when we run this code for small samples, the correlations for all dtypes agree, and do not substantially change with the sample size.

reproduce_error(1)
    n: 1
    int vs int: 0.7621991222319221
    float vs int: 0.7621991222319221
    int vs float: 0.7621991222319221
    float vs float: 0.7621991222319221

reproduce_error(10)
    n: 10
    int vs int: 0.7621991222319219
    float vs int: 0.7621991222319219
    int vs float: 0.7621991222319219
    float vs float: 0.7621991222319217

reproduce_error(100)
    n: 100
    int vs int: 0.7621991222319221
    float vs int: 0.7621991222319221
    int vs float: 0.7621991222319221
    float vs float: 0.7621991222319215

However, past a certain point, the computations diverge:

reproduce_error(10000)
    n: 10000
    int vs int: 0.890284163962155
    float vs int: 0.890284163962155
    int vs float: 0.7621991222319217
    float vs float: 0.7621991222317823

I've started casting everything to float before computing the correlations to avoid this issue.

vnmabus commented 10 months ago

It seems that an overflow occurs in the last case, as I obtain the following warning:

/home/carlos/Programas/Utilidades/Lenguajes/miniconda3/envs/fda311/lib/python3.11/site-packages/dcor/_dcor_internals.py:188: RuntimeWarning: overflow encountered in scalar multiply
  third_term = a_total_sum * b_total_sum / n_samples

I will check if it is possible to avoid it when I have a moment.

vnmabus commented 10 months ago

I added a possible fix in #60. Note that converting to floating point arrays is still preferable, because the AVL implementation is compiled in that case.

vnmabus commented 9 months ago

I will merge #60. Note that this can STILL overflow, specially in Windows where the default integer type (used for integer reductions, if the original type was smaller) has only 32 bits. As mentioned before, converting to floating point is preferred.

vnmabus / dcor

Seemingly incorrect results with `int` datatype #59