Closed vishnuu closed 4 years ago
Dear Vishnuu, thanks for using Benford_py. In examining your issue, I even discovered a Bug in the code, but maybe not in the way you expected. Let me explain:
The series of records you are evaluating is made of integers, so in the function call you must set the decimals
parameter to 0
. But when I did this, the internal check for numerical data says it is not int nor float, because it is checking for the name "int", while the dtype in pandas is "int64". I shall fix it, then.
Now to your problem, I could run it by converting it to floats, and setting decimals
to 0
, as shown in the image below.
There is no use applying other tests, such as the first two digits, because the sample is too small, or the second digit, since there is a lot of single-digit entries.
Give it a try this way and let me know if it still doesn't work. Best wishes, Milcent
Dear Milcent,
Thank you very much for your help.
But i get error with the line of code
f1d = bf.first_digits(data.iloc[:,2].astype(float), digs=1, decimals=0, confidence=99, high_Z='all' )
I tried the following
data = pd.read_csv("C:/Users/data-01", sep="\s+", #separator whitespace, header=None)
data=pd.DataFrame(data)
data[2]= np.log(data[2]/data[2].shift())
for this line of code i got the following result,
f1d = bf.first_digits(data[2], digs=1, decimals=8, confidence=95)
Expected Found Z_score First_1_Dig 5 0.079181 0.320175 26.891600 6 0.066947 0.245614 21.522345 2 0.176091 0.217105 3.208312
But for this line of code i got the error
f1d = bf.first_digits(data.iloc[:,2].astype(float), digs=1, decimals=0, confidence=99, high_Z='all' )
Initialized sequence with 912 registries.
Traceback (most recent call last):
File "benford_gas.py", line 66, in
I also tried with another dataset from the link, where the columns are decimals. But still i got a very high Z score. I tried the first sample in the zip file, checked the fifth column and the following result. https://archive.ics.uci.edu/ml/machine-learning-databases/00487/
Expected Found Z_score First_1_Dig 4 0.096910 0.190353 118.309580 9 0.045757 0.059377 24.406925 1 0.301030 0.307908 5.613896 3 0.124939 0.128909 4.493977
Can you please have a look at this?
Many Thanks, Vishnu
I got the same results as yours in the log-transformed df[2]
column, and much worse (high Z) results in the gas dataset. I didn't have the divide by zero error, though.
I think the problem with both datasets is that the values don't span through many orders of magnitude, as confirmed by their ranges (df.describe()
).
And the gas dataset is also pretty large, so the Z-test gets very sensitive (power problem). Very high Z-scores in very large samples may be counterbalanced by setting the limit_N
parameter to a smaller number, like 2,500, and/or settign the MAD
parameter to True, which will cause it to compute de Mean Absolute Deviation, a measure more robust to large samples. I tried that, but it still showed no conformity.
I also tried the First Two Digits test (digs=2
), since the sample was large, also to no avail.
`The Mean Absolute Deviation is 0.1613549019904375
For the First Digit:
I tested again with another different dataset (Intel Sensor dataset). http://db.csail.mit.edu/labdata/labdata.html But it showed no conformity. In general i want to know, whether Sensor measurement data conforms with Benford law. That's why trying with sensor data. Now i have a doubt, only population, transactions data obey Benford? Sensor data comes under the category of non-obeying data?
I'm sorry, I can't tell you for sure that the data of your interest conforms to Benford or not.
When there is no certainty about it, we just have to test with several samples and find out.
What I can say, though, is it is not about the sensor, but about what the sensor is measuring.
If it is temperature, or humidity, their small span will probably cause them not to conform.
The measurements of light and voltage, however, could be good candidates, since they might be as low as zero and as high as hundreds of thousands. And this is not the only feature of Benford-compliant datasets. Check this reference for more.
I wish I could have been more helpful...
Thank you so much for the explanation. I will read further about the law.
OK. I'll close here then, but fell free to jump in should you need more help. Cheers
Dear @vishnuu, I was thinking about your problem the other day when releasing the latest version of the package and updating the Demo notebook. Maybe your sensor data is not Benford-compliant (like the SPY closing prices on the notebook), but their variations may be (like the SPY returns). So, if you have lots of sensor readings, you could test the variations (percentage or log) across time. Let me know if it helps. Cheers, Milcent
data-0.txt
I am new to Benford law. I tested whether the third column of data-0.txt obeys the benford law and i got very big like Z score like 26, 86, 226. Is it normal to get? or am i doing anything wrong? If it is normal, does it mean the column does not obey benford law?
Any help would be very great.