Histograms: Fix rounding and improve format

ramongonze commented 5 months ago

Problem: The histogram provided by the package (for all attacks) is rounding probabilities with <0.5% to 0%, but for any attack the probability can never be 0. The probabilities should not be rounded.

Improvement: Each probability is being rounding before generating the histogram. The histogram could be given without any rounding in the following way:

Set a histogram with 100 bins: (0,1], (1,2], ..., (99,100]
Each probability could be placed in the corresponding bin. For instance, 0.4375 would be placed in the 43th bin, (43,44].

As the histogram is already a dictionary, it could be given with the interval of bins as the labels. For example:

{
"(0,1]": probability_1,
"(1,2]": probability_2,
...
"(99,100]": probability_100,
}

luigidcsoares commented 5 months ago

Regardless of what we use as the labels, we need to compute which bin each group belongs to. This is currently done by rounding, which causes the problem. Replacing round with math.ceil solves the problem partially. It helps with the rounding error, but there still might be floating-point errors. For example, say the actual value is 0.05, meaning that it should be placed in the bin (4, 5], but due to floating-point errors we get 0.0500000000001, Then, it is going to be rounded up to 0.06, and thus it is going to be placed in the bin (5, 6]. We can solve this, e.g., by using the Decimal type instead, but (probably) at the cost of performance. Nevertheless, perhaps this is not a real concern, as it should not occur very often, and the histogram is just a visual representation to try and understand how the data behaves.

Regarding the labels, @nunesgh I think you can use tuples for the intervals, and (this may depend on the Pandas version, but I'm using an old one, 1.5.3) you don't need to convert the dictionary to string:

>>> df = pd.DataFrame({"histogram": [ { (1, 2): 0.8, (2, 3): 0.2 }, { (1, 2): 0.3, (2, 3): 0.7 }  ]})
>>> df.columns
Index(['histogram'], dtype='object')
>>> df
                    histogram
0  {(1, 2): 0.8, (2, 3): 0.2}
1  {(1, 2): 0.3, (2, 3): 0.7}
>>> df.loc[0, "histogram"]
{(1, 2): 0.8, (2, 3): 0.2}
>>> type(df.loc[0, "histogram"])
<class 'dict'>

ramongonze commented 5 months ago

I still don't understand why it's necessary to round (or ceiling) the probabilities instead of only using the original value itself and place it in the proper bin.

luigidcsoares commented 5 months ago

place it in the proper bin.

And how would we do that without computing the index of the bin? We could traverse all the bins every time and compare the probability against the intervals, but it's worst in terms of performance and it doesn't solve the floating-point error.

ramongonze commented 5 months ago

If there are 100 bins, where bin 1 is (0,1], bin 2 (1,2], and so on, for a given probability $p$ the index will be int(p*100). More precisely, max(1,int(p*100)), to put probabilities $p < 0.01$ in the first bin.

luigidcsoares commented 5 months ago

If there are 100 bins, where bin 1 is (0,1], bin 2 (1,2], and so on, for a given probability p the index will be int(p*100). More precisely, max(1,int(p*100)), to put probabilities p<0.01 in the first bin.

You're still rounding, but down instead (or, more precisely, truncating). And, it doesn't work the way we want. Say $p = 1.5\%$. Then, max(1, int(1.5/100 * 100)) = max(1, int(1.5)) = max(1, 1) = 1, which in your approach corresponds to bin (0, 1], but 1.5 should be placed in the bin 2 (1, 2].

ramongonze commented 5 months ago

I'm sorry, you're right. It seems that using ceiling is more appropriate. So the bin index would be ceil(p*100), where $p\in[0,1]$, right?

luigidcsoares commented 5 months ago

So the bin index would be ceil(p*100), where p∈[0,1], right?

That's right! Take a look at the changes in #5.

garunmishra commented 3 weeks ago

Review the latest update and respond here: New Comment

This is critical—take action now.

Stay secure!

utycampus commented 3 weeks ago

You can check out the latest update and respond directly here: New Comment. Click the link to access!

nunesgh / bvm-library

Histograms: Fix rounding and improve format #4