tensorflow / data-validation

Library for exploring and validating machine learning data
Apache License 2.0
763 stars 173 forks source link

Jensen Shannon implementation #180

Closed Alpha009 closed 3 years ago

Alpha009 commented 3 years ago

What is taken as input here to find out Jensen shannon Divergence. Is it Probabilities for the pandas column(numerical) or probability density function of the column?

Like in this code--

tfdv.get_feature(schema1, 'duration').drift_comparator.jensen_shannon_divergence.threshold = 0.01

The duration column here is first converted into what? Before feeding to find out the JS divergence value

arghyaganguly commented 3 years ago

@Alpha009 , thanks for bringing this up. I feel like we need the pdf of the 'duration' column before feeding out the JS divergence value. Let me forward this to @caveness.

caveness commented 3 years ago

Sorry for the delay on this. We use the standard histogram and calculate the JSD as shown here:

https://github.com/tensorflow/data-validation/blob/9fbc050580fb2433f7fbd6276bbf5b0e654787d1/tensorflow_data_validation/anomalies/metrics.cc#L266

Please feel free to reopen if more information is needed.