microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.05k stars 830 forks source link

[BUG] Relative entropy should be computed using base 2 logarithm for the Jensen-Shannon distance to be bound between 0 and 1 #2006

Open perezbecker opened 1 year ago

perezbecker commented 1 year ago

SynapseML version

0.11.1

System information

Describe the problem

One of the Distribution Balance Measures you use is the Jensen-Shannon Distance, defined in terms of the relative entropy in line 238 of DistributionBalanceMeasure.scala. The relative entropy is defined in line 276 of the same file as:

D = SUM(distA* log(distA/distB)).

This formula applies only when computing entropy in base e (see scipy doc). But for the Jensen-Shannon Distance to be bound between 0 and 1 (as stated in the documentation), the entropy needs to be computed using the base 2 logarithm (see Jensen-Shannon Distance wiki page). The definition of entropy used for the Jensen-Shannon distance thus should be:

D = SUM(distA * log(distA/distB)) / log(base)

Under the current definition the theoretical maximum Jensen-Shannon Distance is sqrt(ln(2))=0.83255... < 1

Code to reproduce issue

Here is an example of two extremely drifted distributions. Their Jensen-Shannon Distance has already converged to the theoretical maximum value stated above.

imbalanced_color_list = ['red'] * 9999999 + ['blue']
imbalanced_reference_dist = [{'red':0.0000001, 'blue':0.9999999}]
df_imbalanced = spark.createDataFrame(imbalanced_color_list, StringType()).toDF("color")

distribution_balance_measure_imb = (
    DistributionBalanceMeasure()
    .setSensitiveCols(['color'])
    .setReferenceDistribution(imbalanced_reference_dist)
    .transform(df_imbalanced).select("FeatureName","DistributionBalanceMeasure.js_dist")
)

distribution_balance_measure_imb.show(truncate=False)
+-----------+-----------------+
|FeatureName|js_dist          |
+-----------+-----------------+
|color      |0.832553583110652|
+-----------+-----------------+

We can reproduce this result using the Jensen-Shannon implementation in Scipy:

from scipy.spatial import distance
import numpy as np
import math

def jensen_shannon_distance_categorical(x_list, y_list, base=2):

    # unique values observed in x and y
    values = set(x_list + y_list)

    x_counts = np.array([x_list.count(value) for value in values])
    y_counts = np.array([y_list.count(value) for value in values])

    x_ratios = x_counts / np.sum(x_counts)  #Optional as JS-D normalizes probability vectors
    y_ratios = y_counts / np.sum(y_counts)

    return distance.jensenshannon(x_ratios, y_ratios, base=base)

imbalanced_source = ['red'] * 9999999 + ['blue']
imbalanced_target = ['red']  + ['blue'] * 9999999

jensen_shannon_distance_categorical(imbalanced_source, imbalanced_target, base=math.e)
0.832553583110652

jensen_shannon_distance_categorical(imbalanced_source, imbalanced_target, base=2)
0.999998765189656

When computing the Jensen-Shannon distance using base e logarithms for our example, or result approaches sqrt(ln(2))=0.83255..., while when using base 2 logarithms, the result approaches desired value of 1.

Other info / logs

No response

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

github-actions[bot] commented 1 year ago

Hey @perezbecker :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

memoryz commented 1 year ago

@ms-kashyap can you please take a look?