High Z score and its intrepretation

vishnuu commented 5 years ago

I am new to Benford law. I tested whether the third column of data-0.txt obeys the benford law and i got very big like Z score like 26, 86, 226. Is it normal to get? or am i doing anything wrong? If it is normal, does it mean the column does not obey benford law?

Any help would be very great.

milcent commented 5 years ago

Dear Vishnuu, thanks for using Benford_py. In examining your issue, I even discovered a Bug in the code, but maybe not in the way you expected. Let me explain:

The series of records you are evaluating is made of integers, so in the function call you must set the decimals parameter to 0. But when I did this, the internal check for numerical data says it is not int nor float, because it is checking for the name "int", while the dtype in pandas is "int64". I shall fix it, then.

Now to your problem, I could run it by converting it to floats, and setting decimals to 0, as shown in the image below.

FireShot Capture 093 - Untitled - localhost

There is no use applying other tests, such as the first two digits, because the sample is too small, or the second digit, since there is a lot of single-digit entries.

Give it a try this way and let me know if it still doesn't work. Best wishes, Milcent

vishnuu commented 5 years ago

Dear Milcent,

Thank you very much for your help.
But i get error with the line of code f1d = bf.first_digits(data.iloc[:,2].astype(float), digs=1, decimals=0, confidence=99, high_Z='all' )

I tried the following data = pd.read_csv("C:/Users/data-01", sep="\s+", #separator whitespace, header=None) data=pd.DataFrame(data) data[2]= np.log(data[2]/data[2].shift())

for this line of code i got the following result, f1d = bf.first_digits(data[2], digs=1, decimals=8, confidence=95)

Expected Found Z_score First_1_Dig 5 0.079181 0.320175 26.891600 6 0.066947 0.245614 21.522345 2 0.176091 0.217105 3.208312

But for this line of code i got the error f1d = bf.first_digits(data.iloc[:,2].astype(float), digs=1, decimals=0, confidence=99, high_Z='all' )

Initialized sequence with 912 registries. Traceback (most recent call last): File "benford_gas.py", line 66, in f1d = bf.first_digits(data.iloc[:,2].astype(float), digs=1, decimals=0, confidence=99, high_Z='all' ) File "C:\Users\benford.py", line 1141, in first_digits ret_df=True) File "C:\Users\benford.py", line 311, in first_digits confidence=confidence) File "C:\Users\benford.py", line 1069, in prep dd['Z_score'] = _Z_score(dd, N) File "C:\Users\benford.py", line 761, in _Z_score return (frame.AbsDif - (1 / (2 * N))) / np.sqrt( ZeroDivisionError: division by zero

I also tried with another dataset from the link, where the columns are decimals. But still i got a very high Z score. I tried the first sample in the zip file, checked the fifth column and the following result. https://archive.ics.uci.edu/ml/machine-learning-databases/00487/

Expected Found Z_score First_1_Dig 4 0.096910 0.190353 118.309580 9 0.045757 0.059377 24.406925 1 0.301030 0.307908 5.613896 3 0.124939 0.128909 4.493977

Can you please have a look at this?

Many Thanks, Vishnu

milcent commented 5 years ago

I got the same results as yours in the log-transformed df[2] column, and much worse (high Z) results in the gas dataset. I didn't have the divide by zero error, though. I think the problem with both datasets is that the values don't span through many orders of magnitude, as confirmed by their ranges (df.describe()). And the gas dataset is also pretty large, so the Z-test gets very sensitive (power problem). Very high Z-scores in very large samples may be counterbalanced by setting the limit_N parameter to a smaller number, like 2,500, and/or settign the MAD parameter to True, which will cause it to compute de Mean Absolute Deviation, a measure more robust to large samples. I tried that, but it still showed no conformity. I also tried the First Two Digits test (digs=2), since the sample was large, also to no avail. `The Mean Absolute Deviation is 0.1613549019904375 For the First Digit:

0.0000 to 0.006: Close Conformity
0.006 to 0.012: Acceptable Conformity
0.012 to 0.015: Marginally Acceptable Conformity
Above 0.015: Nonconformity` Maybe these datasets are not Benford-compliant after all. In order for you to get more confident in the testing side, I suggest you get some well-known-to-be-complinat datasets, like some here link. The countries' cities populations are usually very adherent, as well as the spending data.

vishnuu commented 5 years ago

I tested again with another different dataset (Intel Sensor dataset). http://db.csail.mit.edu/labdata/labdata.html But it showed no conformity. In general i want to know, whether Sensor measurement data conforms with Benford law. That's why trying with sensor data. Now i have a doubt, only population, transactions data obey Benford? Sensor data comes under the category of non-obeying data?

milcent commented 5 years ago

I'm sorry, I can't tell you for sure that the data of your interest conforms to Benford or not. When there is no certainty about it, we just have to test with several samples and find out. What I can say, though, is it is not about the sensor, but about what the sensor is measuring. If it is temperature, or humidity, their small span will probably cause them not to conform. The measurements of light and voltage, however, could be good candidates, since they might be as low as zero and as high as hundreds of thousands. And this is not the only feature of Benford-compliant datasets. Check this reference for more.
I wish I could have been more helpful...

vishnuu commented 5 years ago

Thank you so much for the explanation. I will read further about the law.

milcent commented 5 years ago

OK. I'll close here then, but fell free to jump in should you need more help. Cheers

milcent commented 4 years ago

Dear @vishnuu, I was thinking about your problem the other day when releasing the latest version of the package and updating the Demo notebook. Maybe your sensor data is not Benford-compliant (like the SPY closing prices on the notebook), but their variations may be (like the SPY returns). So, if you have lots of sensor readings, you could test the variations (percentage or log) across time. Let me know if it helps. Cheers, Milcent

milcent / benford_py

High Z score and its intrepretation #20