Open perezbecker opened 1 year ago
Hey @perezbecker :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
@ms-kashyap can you please take a look?
SynapseML version
0.11.1
System information
Describe the problem
One of the Distribution Balance Measures you use is the Jensen-Shannon Distance, defined in terms of the relative entropy in line 238 of DistributionBalanceMeasure.scala. The relative entropy is defined in line 276 of the same file as:
D = SUM(distA* log(distA/distB))
.This formula applies only when computing entropy in base e (see scipy doc). But for the Jensen-Shannon Distance to be bound between 0 and 1 (as stated in the documentation), the entropy needs to be computed using the base 2 logarithm (see Jensen-Shannon Distance wiki page). The definition of entropy used for the Jensen-Shannon distance thus should be:
D = SUM(distA * log(distA/distB)) / log(base)
Under the current definition the theoretical maximum Jensen-Shannon Distance is
sqrt(ln(2))=0.83255... < 1
Code to reproduce issue
Here is an example of two extremely drifted distributions. Their Jensen-Shannon Distance has already converged to the theoretical maximum value stated above.
We can reproduce this result using the Jensen-Shannon implementation in Scipy:
When computing the Jensen-Shannon distance using base e logarithms for our example, or result approaches
sqrt(ln(2))=0.83255...
, while when using base 2 logarithms, the result approaches desired value of1
.Other info / logs
No response
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrations