Open ramongonze opened 5 months ago
Regardless of what we use as the labels, we need to compute which bin each group belongs to. This is currently done by rounding, which causes the problem. Replacing round
with math.ceil
solves the problem partially. It helps with the rounding error, but there still might be floating-point errors. For example, say the actual value is 0.05, meaning that it should be placed in the bin (4, 5], but due to floating-point errors we get 0.0500000000001, Then, it is going to be rounded up to 0.06, and thus it is going to be placed in the bin (5, 6]. We can solve this, e.g., by using the Decimal type instead, but (probably) at the cost of performance. Nevertheless, perhaps this is not a real concern, as it should not occur very often, and the histogram is just a visual representation to try and understand how the data behaves.
Regarding the labels, @nunesgh I think you can use tuples for the intervals, and (this may depend on the Pandas version, but I'm using an old one, 1.5.3) you don't need to convert the dictionary to string:
>>> df = pd.DataFrame({"histogram": [ { (1, 2): 0.8, (2, 3): 0.2 }, { (1, 2): 0.3, (2, 3): 0.7 } ]})
>>> df.columns
Index(['histogram'], dtype='object')
>>> df
histogram
0 {(1, 2): 0.8, (2, 3): 0.2}
1 {(1, 2): 0.3, (2, 3): 0.7}
>>> df.loc[0, "histogram"]
{(1, 2): 0.8, (2, 3): 0.2}
>>> type(df.loc[0, "histogram"])
<class 'dict'>
I still don't understand why it's necessary to round (or ceiling) the probabilities instead of only using the original value itself and place it in the proper bin.
place it in the proper bin.
And how would we do that without computing the index of the bin? We could traverse all the bins every time and compare the probability against the intervals, but it's worst in terms of performance and it doesn't solve the floating-point error.
If there are 100 bins, where bin 1 is (0,1]
, bin 2 (1,2]
, and so on, for a given probability $p$ the index will be int(p*100)
. More precisely, max(1,int(p*100))
, to put probabilities $p < 0.01$ in the first bin.
If there are 100 bins, where bin 1 is
(0,1]
, bin 2(1,2]
, and so on, for a given probability p the index will beint(p*100)
. More precisely,max(1,int(p*100))
, to put probabilities p<0.01 in the first bin.
You're still rounding, but down instead (or, more precisely, truncating). And, it doesn't work the way we want. Say $p = 1.5\%$. Then, max(1, int(1.5/100 * 100)) = max(1, int(1.5)) = max(1, 1) = 1
, which in your approach corresponds to bin (0, 1]
, but 1.5 should be placed in the bin 2 (1, 2]
.
I'm sorry, you're right. It seems that using ceiling is more appropriate. So the bin index would be ceil(p*100)
, where $p\in[0,1]$, right?
So the bin index would be
ceil(p*100)
, where p∈[0,1], right?
That's right! Take a look at the changes in #5.
Review the latest update and respond here: New Comment
This is critical—take action now.
Stay secure!
You can check out the latest update and respond directly here: New Comment. Click the link to access!
Problem: The histogram provided by the package (for all attacks) is rounding probabilities with <0.5% to 0%, but for any attack the probability can never be 0. The probabilities should not be rounded.
Improvement: Each probability is being rounding before generating the histogram. The histogram could be given without any rounding in the following way:
As the histogram is already a dictionary, it could be given with the interval of bins as the labels. For example: