Spark Platform: Azure ML Notebooks with Serverless Spark Compute
Describe the problem
When setting a custom reference distribution in DistributionBalanceMeasures, the reported measures are correct when all the categories in the reference distribution are present in the dataset and the reference dataset. Nevertheless, when the target distribution contains categories not present in the dataset, measures are incorrect. It is not unusual for the dataset to miss data in some categories which have a small likelihood of occurring, where this issue will crop up.
Code to reproduce issue
Let me showcase the issue with a toy example.
We will measure the distance between two distributions of a categorical feature using the Jensen-Shannon distance (JSD):
from scipy.spatial import distance
import numpy as np
def jensen_shannon_distance_categorical(x_list, y_list):
# unique values observed in x and y
values = set(x_list + y_list)
x_counts = np.array([x_list.count(value) for value in values])
y_counts = np.array([y_list.count(value) for value in values])
x_ratios = x_counts / np.sum(x_counts) #Optional as JS-D normalizes probability vectors
y_ratios = y_counts / np.sum(y_counts)
# Warning: We are computing the JSD using base e logarithms for now to compare the result with SynapseML.
# For JSD to be bound between 0 and 1 we need to use base 2 logarithms.
# See this issue for details: https://github.com/microsoft/SynapseML/issues/2006
return distance.jensenshannon(x_ratios, y_ratios)
jensen_shannon_distance_categorical(source, target)
0.1644921288538882
Let's compute the JSD leveraging the setReferenceDistribution method of DistributionBalanceMeasure to compute the distance:
We expect the JSD to be larger than in the original example, as the target distribution is even more dissimilar than the original. We confirm our intuition with the scipy JSD implementation:
This is JSD is smaller that the original value and incorrect. This distance is the same if we truncate the reference distribution to exclude the new category and not re-normalize the probabilities for each category:
Hey @perezbecker :wave:!
Thank you so much for reporting the issue/feature request :rotating_light:.
Someone from SynapseML Team will be looking to triage this issue soon.
We appreciate your patience.
SynapseML version
0.11.1
System information
Describe the problem
When setting a custom reference distribution in DistributionBalanceMeasures, the reported measures are correct when all the categories in the reference distribution are present in the dataset and the reference dataset. Nevertheless, when the target distribution contains categories not present in the dataset, measures are incorrect. It is not unusual for the dataset to miss data in some categories which have a small likelihood of occurring, where this issue will crop up.
Code to reproduce issue
Let me showcase the issue with a toy example. We will measure the distance between two distributions of a categorical feature using the Jensen-Shannon distance (JSD):
This is the JSD computed with scipy:
Let's compute the JSD leveraging the
setReferenceDistribution
method ofDistributionBalanceMeasure
to compute the distance:Both answers agree, as expected.
Now, we will introduce a new category into the target distribution
yellow
, which is not present in the source data.We expect the JSD to be larger than in the original example, as the target distribution is even more dissimilar than the original. We confirm our intuition with the scipy JSD implementation:
Now we repeat the calculation with
DistributionBalanceMeasure
:This is JSD is smaller that the original value and incorrect. This distance is the same if we truncate the reference distribution to exclude the new category and not re-normalize the probabilities for each category:
Other info / logs
No response
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrations