detect and warn if frequency scaling is enabled

oreparaz commented 8 months ago

Pornin says here https://github.com/pornin/crrl#benchmarks:

If frequency scaling ("TurboBoost") is not disabled, then you'll get wrong and meaningless results.

The way dudect is designed is meant to provide some degree of protection against bogus data (since we're interleaving measurements); yet, bad data is obv a bad idea and can provide a false sense of security.

We should detect and warn if frequency scaling is enabled on dudect_init(). TBD how to do this, suggestions welcome as comments here. We should strive to be as helpful as we can: in addition to warning the human, we should provide instructions / links on how to disable it.

@itzmeanjan thoughts?

itzmeanjan commented 8 months ago

I've quite a lot to say about this, I'm currently running bunch of dudect based tests and hopefully all of them will finish by tomorrow. After that I'll present my observations here.

And thanks for bringing this up @oreparaz :+1:

J08nY commented 8 months ago

The ArchLinux wiki has a good discussion of frequency scaling and all the different ways it can work on Linux: https://wiki.archlinux.org/title/CPU_frequency_scaling

There are lots of things to consider, so perhaps just pointing to the cpupower toolkit mentioned there is reasonable.

itzmeanjan commented 8 months ago

Just reading page @ https://github.com/google/benchmark/blob/main/docs/reducing_variance.md can give some good idea, but I've more to describe based on my still ongoing investigation of constant-time issues in our crypto module.

oreparaz commented 8 months ago

Thanks for those pointers, @J08nY & @itzmeanjan!!

tomato42 commented 2 months ago

I haven't found frequency scaling to have a hugely negative impact on quality of data when the CPU has sufficient cooling

oreparaz commented 2 months ago

Thanks. Would you mind elaborating a bit? What did you measure? Which scenarios (frequency scaling off / on) did you try?

tomato42 commented 2 months ago

this is the approach I use: https://tlsfuzzer.readthedocs.io/en/latest/timing-analysis.html

in the dedicated machine I use for timing side-channel testing (an i9-12900KS) I've also disabled the time limit to boosting, and slightly decreased max boost (from 5.5GHz to 5.225GHz: I've noticed that the faster execution speed didn't offset the additional noise generated, but that was like a 5-10% increase), but when I tried the same execution without boost I didn't get any improvement in the quality of data collected (the confidence intervals were basically the same) while the execution was much slower. And when I tried disabling boost more generally, I've noticed that it decreased quality of the data, as then there's much more chance that the execution will span the millisecond boundary when the CPU recalculates the power targets and changes its frequency (see: Hertzbleed). I also have configured it with a very aggressive fan profile so as to ensure it stays under the thermal velocity boost temperature.

But! I have also performed a lot of testing on machines that I have performed no tuning what so ever, and for small pieces of code I still am able to measure stuff down to single cycles with just few million measurements.

So, I would strongly suggest to not make any decisions about the quality of data of a particular setup until you have actually tested it. I use boostrapping to calculate confidence intervals of few descriptive statistics (mean, median, and few trimmed means) to evaluate it objectively.

oreparaz commented 2 months ago

Thanks @tomato42 for the detailed writeup. That looks reasonable. Let us know if you learn more about this, your experiments are very interesting.

I think this points towards not hurrying towards closing this issue. It also suggests the original approach of interleaving classes when measuring provides some protection against noisy data.

tomato42 commented 2 months ago

Let us know if you learn more about this, your experiments are very interesting.

I don't think I have much more to learn about this... It's crucial to perform the test in double blind fashion, once you do that, then everything else is just tuning to reduce the amount of noise. What does help and what doesn't help with that is very platform dependant, and not really of much interest to me: when I'm doing this testing in production settings I generally schedule the data collection on dozes on systems in parallel. I just use tuned cpu-isolation profile and that already gets me to a good-enough configuration, I suspect that any improvement from that would require BIOS settings changes and I often don't have access to those so I have little incentive to research that.

What I can tell you is that while running on bare-metal hardware helps, it's possible to get useful data even when running the tests in VMs.

But if you have any particular questions, feel free to ask.

It also suggests the original approach of interleaving classes when measuring provides some protection against noisy data.

Precisely that. When running the test you absolutely must make sure that the test harness nor the unrelated code don't see differences that they can easily correlate with the cases you're testing. But the order of the tests still needs to be random, as otherwise the system noise will be correlated with the test cases and you'll get false positives. In tlsfuzzer I do it by first writing the cases-to-be-tested in random order to a file and then use a super dumb test harness that simply reads a specified number of bytes from a file and sends them to the system under test, writing the processing time to a file. This way the test harness doesn't know what case it's working on (all of them are identical to it) nor the operating system knows what cases the code is working on (again, it just sees certain size buffers being worked on), and if there is some randomisation in the unrelated values, the unrelated code doesn't know either (like in the example of the Bleichenbacher test, there are lots of different ciphertexts that decrypt to the same size message, so I don't need to send the same two ciphertexts over and over to see if the implementation processes messages in different length in different time).

Two simple examples of this approach I can provide are the "constant-time multiple precision integer" library: https://github.com/tomato42/ctmpi (that code got integrated into Mozilla's NSS and was used in the initial fix for CVE-2022-4304 in OpenSSL) or the complete toy example of "Experimenting with side-channel attacks yourself!" but that one doesn't have a side-channel free implementation to test as a counter-example.

oreparaz / dudect

detect and warn if frequency scaling is enabled #35