Closed ajxz closed 5 years ago
From your question, I assume you are using the first_digits test (please confirm). This happens because one may usually have entries with floating points that, after being internally multiplied by 10 1** (first digit), will still be a number less than 1. So, the function discards them. You can easily fix this by increasing the decimals parameter, whose default is 2. Just be advised that this may not be suitable for any dataset. Basically, it is a tradeof between discarding a (hopefully small) part of the sample and keping records that may distort results. If you check the demo noterbook, I study daily SPY returns, which may be really small (0.000935) so I set the decimals to 8, so as to bring many digits to the left of the floating point before checking their position.
@milcent Thank you very much for your reply. And yes,I am using the first_digits test.Your work can greatly help me to complete an analysis of whether the subsidy amount data is true. The problem I am having now is that I will discard a lot of data during the running process. I don't know if the data span I need to check is particularly large: from tens to hundreds of thousands, and the amount of data is close to one million. Please help me, I am very sorry that I replied late because we are not in a time zone.
I am glad this lib may be helpful to you. Knowing nothing about your dataset, I would not worry about 20 records being discarded among thaousands. The way I see it, you got a pretty large sample. You may, however, suffer from other issue, which is the power problem: it may give false positive flags when your sample is too large. You may need to raise confidence level (say, to 99, 99.9 etc) só as to make less sensitive. I am in Brazil - UTC-3. If your data characteristics prevent you from talking openly about IT here, reach me by email: marcelmilcent@gmail.com
Thank you again for your attention to my question, thank you for your reply.I especially want you to help me check it out, but the data can't be exported on the isolated physical machine, I can't send it to you. During the operation, more than 40,000 pieces of data were discarded, and a total of 880,000 pieces of data.
Did you try increasing the decimals parameter? Try increasing it incrementally (3, 4, 5...) and and check if the number of discarded records decreases.
Thank you,I try 8 to the parameter.I divide all the data by 1000, and then the data that is discarded when running is obviously very small. But I am not sure if this is appropriate. Do you think this is okay?
If you are setting the decimals to 8 and dividing the data by 1000, it is the same as just setting decimals to 5, since what decimals does is setting the number all the data will be multiplied by (10 ** decimals). It is a preparation step to dislocate the floating point, so it is easier for the internals to check the positions of the digits, according to the teste used. If you are using the _firstdigits, this should be no problem. But your dataset is really large (> 800,000 records), so you might want to consider using the first two, or even the first three digits tests, or break down your data in chunks, according to time periods etc. I just remembered now that you can also set decimals to 'infer', and it will work it out for every record, but this will decrease performance. Anyway, your issue made me consider trying to find a way around all this...
Thank you very much! you help me a lot.♥️
No problem. I will close this issue now. Should you need anything else, reach by email. Cheers
@milcent Could please tell me that the “discarded 20 records <1 after preparation ”means?do I need fix something?