milcent / benford_py

Python implementation of Benford's Law tests.
BSD 3-Clause "New" or "Revised" License
151 stars 52 forks source link

Add sample-independent conformity tests - Bhattacharyya distance and Kullback–Leibler divergence #50

Closed milcent closed 3 years ago

milcent commented 3 years ago

This issue was born from a discussion with Rosa María Maza Quiroga, a Ph.D. student at the University of Malaga. She asked for the exact p-values when executing the Kolmogorov-Smirnov test, which is not implemented (just the critical ones), and ended giving it up since her samples were really big, and the KS is known to be best-suited to continuous distributions. So, she started using the Bhattacharyya distance and Kullback–Leibler divergence and suggested we implemented here. She has been kind enough to provide the basis for the code, which I reproduce here so I can find it easier when I´m implementing them:

def bhattacharyya_coefficient(distribution_1, distribution_2):
    return np.sum(np.sqrt(distribution_1 * distribution_2))

def bhattacharyya_distance(distribution1, distribution2):
    return -np.log(bhattacharyya_coefficient(distribution1, distribution2))

def kullbackLeibler_divergence(distribution_1, distribution_2):
    return np.sum(np.where(distribution_1 != 0, distribution_1 * np.log(distribution_1/ distribution_2),0))
Rosammq commented 3 years ago

Thanks for the presentation @milcent Milcent. It is a pleasure to collaborate with you.

As is known, with huge samples (more than 1.000.000 samples), it doesn't make sense to use the Chi2 test because slight variations on the distribution's comparisons show huge differences on the chi2 test.

Also, using the Kolmogorov-Smirnoff test is suitable for continuous variables, not for discrete variables.

So the Bhattacharyya coefficient and the Kulback-Leibler divergence are good metrics to compare one sample distribution with Benford distribution in all cases (small or big samples).

We, the researcher community in general, hope that you would found time to implement this in your package. It will be beneficial for our work =D

milcent commented 3 years ago

I have implemented them in the newest part of the package, inside every test (F1D, F2D, SD...) of the Benford object, and they can be directly accessed as tests attributes, or as part of the report() method.

It is not yet on PyPi, so you will only be able to install it right from the repo, with the command bellow: python -m pip install git+https://github.com/milcent/benford_py.git@develop

Please try it out in your data and see if it is consistent. I had to change some parts of your KL-divergence function not to have numpy complaining abaout zero division in log.

Example code bellow: image

milcent commented 3 years ago

I have also implemented both as stand-alone functions: bf.bhattacharyya_distance(data, "F1D", decimals=2) and bf.kullback_leibler_divergence(data, 2, decimals=5)

Rosammq commented 3 years ago

You are great! I want to test your new code!!

I'm developing a new code (typical eval.py, test.py.... better structured than the one I did in Jupyter Notebook using your package) because I need to publish the code in GitHub to the reviewers of MICCAI 2021. So, If your new version of your package includes Bhattacharyya and Kullback maybe I can include the results of your functions instead of doing mine.

So... If I can save the Bhattacharyya and Kulback values when I execute Benford is fantastic!

In this way, I'll save each value of Bhattacharyya and Kullback (for each image of my dataset) to do Machine Learning.

I think if I use your package is better for you because a lot of people can discover it :)

milcent commented 3 years ago

You should be able to install it via pip now with the implementation, since I published version 0.4.0. Keep me posted. Cheers

Rosammq commented 3 years ago

Hi Marcel!

Maybe I'm doing something bad...

I create a new repository because I want to test your new version @.***MS-7B22:~$ conda create -n newBenford

So, I did on the terminal: @.MS-7B22:~$ conda create -n newBenford @.MS-7B22:~$ conda activate newBenford (newBenford) @.MS-7B22:~$ pip install benford_py (newBenford) @.MS-7B22:~$ conda list

packages in environment at /home/rosammq/miniconda3/envs/newBenford:

#

Name Version Build Channel

_libgcc_mutex 0.1 main benford-py 0.4.0 pypi_0 pypi ...

It is here! On the correct version (0.4.0)

So now I open a new Jupyter notebook and copy/paste your code of above email: import numpy as np import benford as bf a = np.random.randn(3000) b = np.random.randint(1, 99, 3000) c = np.random.rand(3000) abc = a b c bo = bf.Benford(abc) bo.F1D.report(show_plot=True)

But I have some errors in: bo.F1D.bhattacharyya_distance AttributeError: 'Test' object has no attribute 'bhattacharyya_distance' bo.F1D.kullback_leibler_divergence AttributeError: 'Test' object has no attribute 'kullback_leibler_divergence'

Did change you your code? Or .... Do I do something wrong?

If I print bo.F1D, I can see: ExpectedCountsFoundDifAbsDifZ_score First_1_Dig 1 0.301030 857 0.286430 -0.014600 0.014600 1.721017 2 0.176091 555 0.185495 0.009403 0.009403 1.326385 3 0.124939 417 0.139372 0.014433 0.014433 2.359986...

any bhattacharyya_distance or kullback_leibler_divergence parameters.

I'm nervous to work with your new version!!

Cheers, Rosa :)

El mar, 27 abr 2021 a las 15:43, Marcel Milcent @.***>) escribió:

Hey, congrats on the research grant!! You should be able to install it via pip now with the implementation, since I published version 0.4.0. Keep me posted. Cheers

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/milcent/benford_py/issues/50#issuecomment-827616791, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWJHR6D7FMBZC53WIWPUODTK25PFANCNFSM4YV27NZA .

milcent commented 3 years ago

You are right! I have mixed up the names inside the class, and added "_" to them. I will fix them now. Meanwhile, you will be able to access them by adding "_" in the begginning and end, liike so:

Rosammq commented 3 years ago

Hi Marcel!

But I can't see anything above Bhattacharyya and Kulback:

bo.F1D.report(show_plot=False)

############### First Digit Test ############### Mean Absolute Deviation: 0.004534 MAD <= 0.006000: Close conformity. For confidence level 95%: Kolmogorov-Smirnov: 0.011414 Critical value: 0.024876 -- PASS Chi square: 8.171584 Critical value: 15.507000 -- PASS Critical Z-score:1.96. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 4 0.09691 0.108397 2.09202

I also try with other new repository installing with python -m pip install @.***

But it doesn't work either.

Best, Rosa :)

El mié, 28 abr 2021 a las 14:54, Marcel Milcent @.***>) escribió:

You are right! I heve mixed up the names inside the class, and added "

" to them. I will fix them now. Meanwhile, you will be able to access them by adding "" in the begginning and end, liike so:

  • bhattacharyya_distance ; and
  • kullback_leibler_divergence And the report() method call in each test should work too.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/milcent/benford_py/issues/50#issuecomment-828431360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWJHRYBT7GJ3AKNNDNFX7DTLAARVANCNFSM4YV27NZA .

milcent commented 3 years ago

OK. I will dig deeper into this. The code is already there for sure, as you ca see here The last resort for now then is try to install from here:

Rosammq commented 3 years ago

Hi Marcel!

Ok, is fine! with '_'

It works with python -m pip install git+ @. and also, of course, with python -m pip install git+ @.

Before Visual Code didn't work because (I think) had a problem with the interpreter and kernel.

So... I'm going to test with my dataset to check the results ;)

El mié, 28 abr 2021 a las 15:08, Marcel Milcent @.***>) escribió:

OK. I will dig deeper into this. The code is already there for sure, as you ca see here https://github.com/milcent/benford_py/blob/0c59af2c7e9eecddcb15a1a188f670a0ab0afeab/benford/benford.py#L133 The last resort for now then is try to install from here:

  • pip uninstall benford-py
  • python -m pip install git+ @.***

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/milcent/benford_py/issues/50#issuecomment-828441251, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWJHR2N4TTHKWR56EWMZWLTLACE7ANCNFSM4YV27NZA .

Rosammq commented 3 years ago

I test it with my data set and the results are fine =D

But.. what about the Bhattacharyya coefficient? This parameter is better than Bhattacharyya distance. Because the coefficient does some idea about the match with Benford's distribution. However, Bhattacharyya distance doesn't do any understandable information.  0 <= BC <=1 0 <= BD <= inf

Could you assess the possibility of adding that parameter or even changing distance by coefficient?

You can see it in: https://en.wikipedia.org/wiki/Bhattacharyya_distance#:~:text=The%20Bhattacharyya%20coefficient%20is%20an,the%20two%20samples%20being%20considered.&text=are%20the%20numbers%20of%20members,in%20the%20i%2Dth%20partition.

milcent commented 3 years ago

Sure! It is already implemented, since the distance is computed with the coefficient, right? But it is not exposed as the class attribute. It should be easy now. I will let you know.

Em qua., 28 de abr. de 2021 às 17:56, Rosammq @.***> escreveu:

I test it with my data set and the results are fine =D

But.. what about the Bhattacharyya coefficient? This parameter is better than Bhattacharyya distance. Because the coefficient does some idea about the match with Benford's distribution. However, Bhattacharyya distance doesn't do any understandable information. 0 <= BC <=1 0 <= BD <= inf

Could you assess the possibility of adding that parameter or even changing distance by coefficient?

You can see it in:

https://en.wikipedia.org/wiki/Bhattacharyya_distance#:~:text=The%20Bhattacharyya%20coefficient%20is%20an,the%20two%20samples%20being%20considered.&text=are%20the%20numbers%20of%20members,in%20the%20i%2Dth%20partition .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/milcent/benford_py/issues/50#issuecomment-828774240, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQTJ4YT5IZG4RQNZIH5HYDTLBZAFANCNFSM4YV27NZA .

Rosammq commented 3 years ago

Yes!

It sounds great! Let me know the news.

Thanks Marcel.

milcent commented 3 years ago

Done. You should be able to use it in the latest release (0.4.1), which is already installable via pip. Best wishes

Em qua., 28 de abr. de 2021 às 18:05, Rosammq @.***> escreveu:

Yes!

It sounds great! Let me know the news.

Thanks Marcel.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/milcent/benford_py/issues/50#issuecomment-828779491, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQTJ45IPQRUAM4VFZS6RYLTLB2BPANCNFSM4YV27NZA .