smuellerDD / jitterentropy-library

Jitterentropy Library
http://www.chronox.de/jent
Other
99 stars 39 forks source link

Overestimation of entropy #21

Closed kroeckx closed 3 years ago

kroeckx commented 3 years ago

I've been looking at the entropy collected by the library and how this is used in OpenWRT's urngd.

My understanding of the code is that jent_read_entropy() returns things in blocks of 256 bit (32 byte). It internally calls jent_measure_jitter(). The code seems to assume that 1 call to jent_measure_jitter() collects 1 bit of entropy, and so that in the case of the oversampling set to 1, as the documentation suggests, you do 256 calls to jent_measure_jitter().

I've run some statistical tests, and it seems that I'm getting less than 0.01 bit of entropy per call of jent_measure_jitter() on a non-idle desktop system. I have no idea why you think you can get 1 bit of entropy out of such a function.

smuellerDD commented 3 years ago

Am Dienstag, dem 09.02.2021 um 13:26 -0800 schrieb Kurt Roeckx:

I've been looking at the entropy collected by the library and how this is used in OpenWRT's urngd.

My understanding of the code is that jent_read_entropy() returns things in blocks of 256 bit (32 byte). It internally calls jent_measure_jitter(). The code seems to assume that 1 call to jent_measure_jitter() collects 1 bit of entropy, and so that in the case of the oversampling set to 1, as the documentation suggests, you do 256 calls to jent_measure_jitter().

Correct. See [1] section 3.3 for the mathematical description and 7.2.6 for the description of the entropy rate each step delivers.

I've run some statistical tests, and it seems that I'm getting less than 0.01 bit of entropy per call of jent_measure_jitter() on a non-idle desktop system. I have no idea why you think you can get 1 bit of entropy out of such a function.

Could you please help me understand what exactly you are measuring and how you reach your conclusion?

With the extensive documentation and test results provided in [1] supported by the test harness given in [2] I have not seen any such results you describe on the hundreds of systems I tested or all processor architectures I validated. Especially on loaded systems, the entropy commonly is higher than on idle systems.

Especially [1] section 5 together with section 7 provides the analysis. Section 6 provides an extensive analysis of the root cause to the extent that I deployed the noise source on a bare metal system without any OS.

Thanks a lot Stephan

[1] https://chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf

[2] https://chronox.de/jent/jitterentropy-testing-20201205.tar.xz

kroeckx commented 3 years ago

I used your software to output the raw data, using the tool you've linked to. I converted that file and than ran NISTs SP800-90B entropy non-iid assessment tool on in it, which returns things like:

H_original: 0.004746 H_bitstring: 0.000698

min(H_original, 8 X H_bitstring): 0.004746

smuellerDD commented 3 years ago

Am Dienstag, dem 09.02.2021 um 23:38 -0800 schrieb Kurt Roeckx:

I used your software to output the raw data, using the tool you've linked to. I converted that file and than ran NISTs SP800-90B entropy non-iid assessment tool on in it, which returns things like:

H_original: 0.004746 H_bitstring: 0.000698

min(H_original, 8 X H_bitstring): 0.004746

That is very interesting and indicates that you have a very coarse timer. The Jitter RNG requires a high-res timer. This ought to be checked with the initialization call of jent_entropy_init(). What is the return code of that call? If that return code is 0 (i.e. success), the question now raises, whether the Jitter RNG uses its internal high-res timer generator (if you use 3.0.x). In this case, you would need to check back whether your entropy measurement uses the internal timer generator.

Ciao Stephan

smuellerDD commented 3 years ago

I also checked the test and it seems that I did not fully document how to test the case when the Jitter RNG uses the internal timer: If this is the case for you, invoke the jitterentropy-hashtime as follows: jitterentropy-hashtime <rounds per repeat> <number of repeats> <filename> 1

Note, the 5th argument (does not matter exactly what you specify as argument) enables the testing of the internal timer generator instead of using the external time source. Could be this the solution to your issue?

kroeckx commented 3 years ago

I do not believe I have a coarse timer. It was using the rdtsc instruction. From previous tests, I know it works properly. I also know that from run to run it behaves differently if not using a serialization instruction (like cpuid).

I've now changed the FORCE_NOTIME_NOISE_SOURCE to 1 in invoke_testing.sh. If I understand things correctly, that uses a different thread to do nothing else but increase a value, and then have the other thread read that value, adding some scheduling jitter between the threads.

That changes the entropy estimate to 0.11.

Your test program does not seem to call jent_entropy_init.

smuellerDD commented 3 years ago

Am Mittwoch, dem 10.02.2021 um 10:29 -0800 schrieb Kurt Roeckx:

I do not believe I have a coarse timer. It was using the rdtsc instruction.

And you are not running on top of a VMM that hijacks rdtsc?!

From previous tests, I know it works properly. I also know that from run to run it behaves differently if not using a serialization instruction (like cpuid).

Can you please send me your output file in a private mail?

Which CPU do you have (i.e. which content does /proc/cpuinfo have)?

I've now changed the FORCE_NOTIME_NOISE_SOURCE to 1 in invoke_testing.sh. If I understand things correctly, that uses a different thread to do nothing else but increase a value, and then have the other thread read that value, adding some scheduling jitter between the threads.

Correct.

That changes the entropy estimate to 0.11.

This is very interesting. Again, can you please send me the data file?

Your test program does not seem to call jent_entropy_init.

The test program does not call this function, right. Only when using the RNG is a normal operation, this call should be used.

Thanks a lot Stephan

kroeckx commented 3 years ago

Can you please send me your output file in a private mail?

rdtsc.data.gz

The test gives this as output: H_original: 0.004873 H_bitstring: 0.000886

min(H_original, 8 X H_bitstring): 0.004873

Which CPU do you have (i.e. which content does /proc/cpuinfo have)?

Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz

This is very interesting. Again, can you please send me the data file?

It seems that was just pure luck that it returned 0.11 2 times before. I assume it was caused by something else going on in the background. I can't reproduce such high numbers currently, the highest I saw now was 0.011 once, all the others were 0.003 - 0.005.

notime.data.gz gives: H_original: 0.003089 H_bitstring: 0.000502

min(H_original, 8 X H_bitstring): 0.003089

It's generated by using your 2nd column and using the last 8 bits of that.

smuellerDD commented 3 years ago

Am Mittwoch, dem 10.02.2021 um 12:01 -0800 schrieb Kurt Roeckx:

Can you please send me your output file in a private mail?

Sorry for the delay - too many things going on.

rdtsc.data.gz

This is not the file I was looking for. How did you come about that file?

Note, I am looking for a file as generated by my tool set that looks like:

6304 2783
3869 2806
5985 2812
6241 3002
5810 2800
6885 2776
3824 2779
5280 2861
8674 2782
6083 2778
5110 2781
2879 2773
4005 2777
9699 2798
...

The two columns are the raw time stamp for the upper and lower boundary.

Now, the key is that for each column, the 8 LSB of the time stamps are extracted and concatenated to a bit stream. This bit stream now has to be given to the SP800-90B tool.

I just saw today that I did not release the validation tool set along with the recording tool set. I will add that shortly. But in the mean time, you can prepare the file with the ASCII time stamps.

Thanks Stephan

kroeckx commented 3 years ago

As I explained in my previous message, I took the last 8 bit of the number in the 2nd column of your file. As in, I write my own tool to turn it in a file that can be used by the SP800-90B tool.

Looking at the code, I do not believe the first column contains more entropy than the first, you just seem to make it harder for the tool to properly estimate it by executing the same code a random amount of times.

smuellerDD commented 3 years ago

Am Dienstag, dem 16.02.2021 um 10:03 -0800 schrieb Kurt Roeckx:

As I explained in my previous message, I took the last 8 bit of the number in the 2nd column of your file. As in, I write my own tool to turn it in a file that can be used by the SP800-90B tool.

I would like to see the raw output data if possible.

Looking at the code, I do not believe the first column contains more entropy than the first, you just seem to make it harder for the tool to properly estimate it by executing the same code a random amount of times.

That may be your interpretation. Other people have different interpretations. To allow everybody to get the full picture, I provide analysis tools for both view points.

In any case, the jent_entropy_init() should detect that your system is not suitable for the Jitter RNG. This result of the API call should imply that the Jitter RNG is not started and thus not running on your system.

Ciao Stephan

kroeckx commented 3 years ago

I should probably clarify my statement. I believe that the random amount of execution does increase the jitter and entropy, but the tools will overestimate the entropy because you're also measuring the output of a random number generator. You should not run an entropy assessment on the output of an RNG.

You're measurement method itself also introduces jitter that's not there when it acts like entropy deamon.

It's not obvious to me how to estimate the real amount of entropy you would normally get, because there are a lot of factors that have an effect, and separating them will probably underestimate it.

smuellerDD commented 3 years ago

Am Mittwoch, dem 17.02.2021 um 11:36 -0800 schrieb Kurt Roeckx:

I think we are in full agreement that the analysis of any data that is post- processed is meaningless for entropy discussions.

Thus allow me to summarize what I am doing.

The concept of the Jitter RNG is as follows:

a = timestamp(); new_entropy_pool = sha3(old_entropy_pool, b(prev) - a); b = timestamp();

The entropy is in the delta of b - a (which is the value that will eventually be post-processed to derive a random number).

Thus, to analyze the entropy behavior and rate, I simply collect the delta (b

As the delta is a 64 bit integer and knowing that the SP800-90B tool set cannot process that amount, from each integer the 8 LSB are being cut out and concatenated to form a bit stream (the 90B tool only handles such bit streams).

Now, the bit stream is fed into the 90B tool and we get an entropy value. Commonly this is 3 to 6 bits (or very low if the timer seems to be too coarse which seems to be the case for your system). This and only this entropy rate is what is of interest.

Yes, my tool set also calculates an entropy rate for the post-conditioned data (i.e. the data coming out of the SHA-3). We all know that this is totally meaningless for an entropy discussion. Yet, SP800-90B wants this number (for whatever reason). So my tool provides it, but none of my rationale rests on that at all.

To your last paragraph: The only operations the Jitter RNG does with the time delta we discussed above is to inject it into a SHA-3. As SHA-3 is considered to be entropy preserving, it should be safe to say that we simply need to sum up the entropy values we calculated for the collected time deltas.

Yet, it would still be great if I can have a peek at your time delta values (b

Ciao Stephan

kroeckx commented 3 years ago

The interface you publish for applications to use and the test tool do something different, and they really should be doing the same thing if you want to understand how much entropy it calls.

Some of the differences:

smuellerDD commented 3 years ago

Am Samstag, 20. Februar 2021, 12:48:27 CET schrieb Kurt Roeckx:

Hi Kurt,

The interface you publish for applications to use and the test tool do something different, and they really should be doing the same thing if you want to understand how much entropy it calls.

Some of the differences:

  • The real application uses the jent_random_data() function that does 256 * osr calls of jent_measure_jitter()

  • Your test application doesn't call jent_measure_jitter(), it calls jent_hash_var_stat() twice, which in turn calls some of the functions jent_measure_jitter() calls.

  • Your test application calls fprintf() after the jent_hash_var_stat() calls, instead of calling it 256 * osr times, which is an additional source of jitter.

I would like you to check the code as follows:

The test code of jent_hash_var_stat performs:

  1. time stamp a

  2. jent_memaccess

  3. jent_stuck

  4. jent_hash_time(a)

  5. time stamp b

  6. return b - a which then is printed out for further assessment

With this code you see that only the execution time of function calls 2 through 5 is measured, nothing else. Any operation outside of this function call is not measured nor is it affecting the entropy as far as I can see.

The RNG code in jent_measure_jitter does:

  1. jent_memaccess

  2. time stamp b

  3. get b - a where a is the time stamp from the previous round

  4. jent_stuck

  5. jent_hash_time(b - a)

This code is repeated 256 times.

You see that code in jent_measure_jitter measures the execution time of the exact same steps as the test code. The only difference is that instead of taking 2 time stamps at the beginning and the end, it takes one time stamp and remembers it for the next loop iteration. This implies that additional steps during that loop like when it finished one 256-loop and then starts another contributes to entropy in jent_measure_jitter. Yet, this additional logic is not measured in the test code -> the test is rather underestimating the entropy.

The only real difference between the test code and the RNG code is that jent_hash_time processes a full time stamp in the test code and a time delta in the RNG code. Yet, as both values are 64 bit integers and jent_hash_time invokes a SHA-3 operation, it is irrelevant to the execution time which value this 64 bit integer has.

To your issues:

Thank you for your review.

Ciao Stephan

kroeckx commented 3 years ago

The code that is executed outside your measurement (the fprintf), has an effect on the measurement. The fprintf() call will internally caches things and after a while does a system call to write it to a file. It will write this in blocks of 4096 bytes for me. An applications will using the library will most likely do a system call every 256 * osr calls instead.

This can have as effects that you now get cache misses, no longer have good branch prediction and so on. They're additional sources of jitter, and you're trying to measure the jitter.

In my testing, changes outside the measurement makes the tool claim at least 4.5 bit more entropy.

smuellerDD commented 3 years ago

Am Samstag, 20. Februar 2021, 15:31:49 CET schrieb Kurt Roeckx:

Hi Kurt,

The code that is executed outside your measurement (the fprintf), has an effect on the measurement. The fprintf() call will internally caches things and after a while does a system call to write it to a file. It will write this in blocks of 4096 bytes for me. An applications will using the library will most likely do a system call every 256 * osr calls instead.

This can have as effects that you now get cache misses, no longer have good branch prediction and so on. They're additional sources of jitter, and you're trying to measure the jitter.

In my testing, changes outside the measurement makes the tool claim at least 4.5 bit more entropy.

Agreed that cache effects are possible. I have changed the code now to allocate a sufficient amount of memory to write the time deltas in and print it out after collecting all requested time deltas. As the following code does not collect the data 256-block wise, but in one go, it should definitely not exhibit any overestimation.

When running the code on my local system (Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz), I get the following values on a totally quiet system (i.e. OS scheduling should not be visible too much here and increase the measured time delta variances):

The values do not seem to really deviate from what I have recorded previously and is documented in the big PDF for this system.

The diff is:

diff --git a/recording_userspace/jitterentropy-hashtime.c b/
recording_userspace/
jitterentropy-hashtime.c
index 9e44f66..629f24d 100644
--- a/recording_userspace/jitterentropy-hashtime.c
+++ b/recording_userspace/jitterentropy-hashtime.c
@@ -55,8 +55,19 @@ static int jent_one_test(const char *pathname, unsigned 
long 
rounds,
        unsigned long size = 0;
        struct rand_data *ec = NULL;
        FILE *out = NULL;
+       uint64_t *duration, *duration_min;
        int ret = 0;

+       duration = calloc(rounds, sizeof(uint64_t));
+       if (!duration)
+               return 1;
+
+       duration_min = calloc(rounds, sizeof(uint64_t));
+       if (!duration_min) {
+               free(duration);
+               return 1;
+       }
+
        printf("Processing %s\n", pathname);

        out = fopen(pathname, "w");
@@ -79,16 +90,17 @@ static int jent_one_test(const char *pathname, unsigned 
long rounds,
        ec->fips_enabled = 1;

        for (size = 0; size < rounds; size++) {
-               uint64_t duration = 0;
-               uint64_t duration_min = 0;
-
-               duration = jent_hash_var_stat(ec, 0);
+               duration[size] = jent_hash_var_stat(ec, 0);
                //fwrite(&duration, 1, 8, out);
-               duration_min = jent_hash_var_stat(ec, 1);
-               fprintf(out, "%lu %lu\n", duration, duration_min);
+               duration_min[size] = jent_hash_var_stat(ec, 1);
        }

+       for (size = 0; size < rounds; size++)
+               fprintf(out, "%lu %lu\n", duration[size], duration_min[size]);
+
 out:
+       free(duration);
+       free(duration_min);
        if (notime)
                jent_notime_unsettick(ec);
        if (out)

Ciao Stephan

joshuaehill commented 3 years ago

For many reasons (described in issue #28), I think that the first column produced by jitterentropy-hashtime isn't very useful for an assessment. As such, I've been looking into results using the second column of jitterentropy-hashtime using the library and test code currently on GitHub, in particular I've been performing some testing inspired by Section 6 of the "CPU Time Jitter Based Non-Physical True Random Number Generator" paper.

All testing was conducted on a fairly quiescent modern bare-metal Intel-based (Cascade lake) system. In order to extract the assessment, I performed 101 million trials (1 million at a time). I discarded a small number of the larger delta values data to deal with atypical events (less than the top 0.28% of the data was discarded) which left me with sample sets containing more than 100 million samples, taken from an alphabet of fewer than 256 distinct delta values. I translated the remaining values to single byte values (using an order-preserving translation), and ran the full 90B assessment on the resulting data. This approach should assess higher than the procedure you outline, as it does not truncate the data to the lowest order nibble.

I use the default Makefile.hashtime, which compiles everything using -O2. I later noticed that disabling optimization has a substantial impact (testing time increases by a factor of about an order of magnitude), but I haven't yet looked into why that is. It also looks like the default library Makefile also uses -O2, so this behavior is likely the relevant behavior in any case.

My main take away here is that the underlying assumption of a worst-case 1 bit per jent_measure_jitter call probably isn't conservative, and that an oversample rate of >= 5 seems like a good idea on my particular system; it seems likely that this is also true for other modern Intel-based systems.

joshuaehill commented 3 years ago

Well, I found out why -O2 reduced the runtime so much; it basically optimizes out the useful parts of the jent_memaccess call. I'll open an issue for the current Makefile and reconduct the testing.

kroeckx commented 3 years ago

This is a version of the test program I was using: jitterentropy-hashtime.c.gz (Based on jitterentropy-testing-20201205).

The attached version doesn't give reproducible results for me on my non-idle desktop, but I get something in the range of 0.15 - 2.4, depending on the run.

If I change things like the SIZE, I get different results. For instance when I change it from 4096 to 256 it returns value in the order of 2.7 - 4.9. If I then uncomment the lines that are commented out, it changes to be between 4.5 and 5.7. So my best estimate is that it changed from 0.15 bit of entropy to 4.5 bit of entropy, just by doing something external to the measurement, and that 0.15 is most likely an overestimate.

joshuaehill commented 3 years ago

I've been testing using the currently checked in version in the master branch for both the library and the testing (because it includes a bunch of SP 800-90B compliance changes that I care about). I have also taken it on myself to change jitterentropy-hashtime.c to:

  1. Disable the (pseudo-)random round data production (I'm not using the data, and the testing would take much longer with that enabled).
  2. Set ec_min->fips_enabled to 1 (as pointed out in issue #32)
  3. Changed the "initialization" jent_measure_jitter prior to the fixed-round outputs to use a fixed round
  4. Output a single file across all rounds in a binary format (uint32_t values) that I use in other tools

I don't think that any of these changes should affect the results, I made them to make large scale testing a bit easier.

I ran tests on a few variations of the __x86_64__ jent_get_nstime implementation. The version checked in uses inline assembly to perform a rdtsc instruction. I made a few variants of this approach:

  1. A version where I use a __rdtsc() compiler intrinsic instead of the inline assembly. (This version is intended to have the same characteristics as the checked in version). This is labeled as the rdtsc variant.
  2. A version where I use a compiler intrinsic to add a cpuid instruction prior to the rdtsc instruction. This is labeled as the cpuid variant.
  3. A version where I use a rdtscp compiler intrinsic instead of the rdtsc compiler intrinsic. This is labeled as the rdtscp variant.
  4. A version where I apply all the fences noted in the intel architecture guide for the rdtscp instruction to force instruction ordering / memory commitment before and after the instruction: the order discussed is mfence then rdtscp then lfence. This is labeled as the the fenced variant.

Each round, I created 1 million delta values. I ran 1000 rounds of each of these variants pinned to a single core, and once for the rdtsc variant, but not pinned to a specific core.

I used clang-11.1.0 to compile all the binaries. As the optimizer seems to have a major impact on the performance of this system, I tested using -O0, -O1, -O2, and -O3.

I conducted all testing on a fairly quiescent modern bare-metal Intel-based (Cascade lake) system.

Here are histograms for the delta values and the assessments for these various situations. If you want to see this data in some other format, let me know.

For -O0: Series Delta Values Assessments
Not pinned
rdtsc
cpuid
rdtscp
fenced
For -O1: Series Delta Values Assessments
Not pinned
rdtsc
cpuid
rdtscp
fenced
For -O2: Series Delta Values Assessments
Not pinned
rdtsc
cpuid
rdtscp
fenced
For -O3: Series Delta Values Assessments
Not pinned
rdtsc
cpuid
rdtscp
fenced
joshuaehill commented 3 years ago

I'm not sure that it's completely clear what conclusions to draw, honestly. :-) Here are few:

  1. Clearly the optimizations used make a big difference, which is already flagged in Stephan's extensive documentation. That being said, it seems reasonable to change the default Makefiles to make "no optimization" default (using -O0 instead of the current flag -O2).
  2. The delta distribution doesn't seem to have a obvious correlation with the assessed entropy. In instances where really low entropy assessments are produced, this seems mainly (perhaps solely?) due to a low entropy assessment in the t-tuple estimator, which is a test that can perform poorly with nearly any given delta histogram. The LRS, lag, MultiMMC, and compression estimators also seems to provide notably low assessments in these cases. Together, these imply that reoccurring multi-symbol patterns can become a problem in this source. It may be a good idea to implement a health test that watches for such low-entropy conditions. The existing APT / RCT are largely insensitive to this type of defect. Most of the listed 90B estimators are quite complicated, but the lag test isn't, so it is a good candidate for converting to a health test.
  3. I was expecting that pinning the testing process to a single core would decrease the assessed entropy, but it appears to have only modest affect with the tested hardware (and the direction of the effect wasn't consistent!). The tested architecture synchronizes the TSC between all cores on the same processor (and there is only one processor on this particular machine), so it may be that pinning isn't really all that important in that context. I'll re-run the other test variants without pinning and try to see what happens.
  4. I wasn't (and still am not) really sure what the appropriate approach to extracting the TSC is. The "fenced" variant of rdtscp seems the best way to accurately measure the variation association with the memory task / SHA-3 conditioning, but in at least some circumstances, it seems that the variation due to when the rdtsc is actually run is a substantial component of the total variation. In addition, the necessary fences are only available with SSE2, and rdtscp is only available in Nehalem and later. The cpuid approach would work, but is more disruptive than the "fenced rdtscp" approach. In addition, to get all the benefits of the "fenced rdtscp" approach, you would need to run he cpuid before and after the rdtsc instruction, I think.
  5. This may just be my ignorance and persistent failure to really learn GCC's inline assembly syntax and conventions, but I think that using compiler intrinsics is a bit easier to read than the existing inline assembly. This may impose some compiler restrictions (but then, using GCC's inline assembly syntax also does so).
  6. I've done quite a bit of testing of prior versions of this project on other architectures, but not with this new code base yet. My suspicion is that the underlying assessment distribution is going to vary significantly with the underlying hardware details, so I think that it is going to be hard to support a broad hardware-independent entropy lower bound claim. (This certainly seemed to be the case with the prior versions that I have tested across several architectures.)
joshuaehill commented 3 years ago

@kroeckx After reviewing your code, I noted that you used a much smaller buffer than I did. Your code works with a buffer of 4096 uint64_t, whereas the way that I was using @smuellerDD's new test code (in tests/raw_entropy/recording_userspace/jitterentropy-base.c) would generally use a much larger buffer (at least, it did with my test parameters). Do you anticipate that using a smaller buffer would lead to more consistent results?

In one of your messages, you mentioned use of a "serialization instruction like cpuid"; did you integrate a cpuid call into the testing somewhere? I'm also curious if you have any comments regarding the use of rdtsc vs some sort of cpuid/rdtsc construction vs rdtscp vs mfence-rdtscp-lfence.

kroeckx commented 3 years ago

I've tried various sizes, from 256 byte to 1 Mbyte. I did not do enough tests with all of them to know if there is a pattern, but I have the feeling that both 256 byte and 1 Mbyte give a higher estimate, while things in the range of 4 kbyte to 64 kbyte give a lower value.

Also note that it's stored in an uint8_t, not an uint64_t, so I just store the last 8 bit. I've tried tests with uint64_t too, which I think gave a higher estimate because the buffer was probably 512 kbyte in that case.

I did not try adding cpuid in any of my tests. I did use it in an other test that's not related to the jitter entropy. It was used to check the execution time of code that was written to be constant time. The results in that test without cpuid where all very strange, it seemed to depend on the data, while the point of the code is that that shouldn't have any effect. It wasn't really reproducible from run to run, and so on. Adding cpuid gave much more useful results for that test. Since we want jitter, it's probably more useful to not have the cpuid instruction in the real code, but it might be useful to test with it because there are also CPUs that don't need such instructions to get reproducible results.

joshuaehill commented 3 years ago

Thanks @kroeckx, that was very helpful!

Your comment helped me firm up some of my fuzzy thinking with respect to where to introduce a synchronizing instruction. In testing, we do care about instruction ordering, but only between production of whatever the "natural sample size" is (that is, however many samples are expected to occur in a row without doing much else). In this library, I think that the natural sample size is (DATA_SIZE_BITS*ec->osr) 64-bit delta samples, as blocks of that many jent_measure_jitter calls grouped together for outputs. As such, I produced a version of jitterentropy-hashtime that buffers outputs into arrays of this size, and introduced a cpuid instruction just before filling each buffer of (DATA_SIZE_BITS*ec->osr) samples. This is both consistent with how the jent_measure_jitter function is used in the deployed library, and it allows me to vary the buffer size altering the osr value.

I'm presently producing data sets for various selections of osr (and thus different buffer sizes) ; I'll post the results when the analysis is done.

joshuaehill commented 3 years ago

I ran a few different configurations (mostly) using the approach that I outlined above. For the non-optimized code (-O0) the resulting assessment was in the interval [2.63, 5.75] for all buffer sizes. (the median assessments were all above 3.8).

I then ran through optimized versions of the same process (using '-O2') and ran two runs, one where I allowed the cpuid instruction between "natural samples" to be optimized out, and one where I forced it to be in place.

For the version where I forced the serializing instruction cpuid to run, I found that the lowest observed assessed median entropy was 0.05 bits of min entropy with a buffer with 2^15 different values. I stored 16-bit values in this test, so that corresponds to a 2^16 byte buffer. The observed assessments were all in the interval [0.03, 2.42].

For the version where the cpuid was optimized out, the lowest observed assessed median entropy was 0.08, with the observed assessments being in the interval [0.04, 2.55].

The versions with the cpuid included between "natural samples" seemed to be much more consistent, and those median assessments seemed to mostly decrease as the buffer size increased.

smuellerDD commented 3 years ago

Am Donnerstag, dem 18.03.2021 um 22:54 -0700 schrieb Joshua E. Hill:

Hi Joshua,

(sorry for the delayed message - but this week was, well, a challenge :-) )

Thanks to you and Kurt for sharing your work and assessments.

For many reasons (described in issue #28), I think that the first column produced by jitterentropy-hashtime isn't very useful for an assessment. As such, I've been looking into results using the second column of jitterentropy-hashtime using the library and test code currently on GitHub, in particular I've been performing some testing inspired by Section 6 of the "CPU Time Jitter Based Non-Physical True Random Number Generator" paper.

All testing was conducted on a fairly quiescent modern bare-metal Intel- based (Cascade lake) system. In order to extract the assessment, I performed 101 million trials (1 million at a time). I discarded a small number of the larger delta values data to deal with atypical events (less than the top 0.28% of the data was discarded) which left me with sample sets containing more than 100 million samples, taken from an alphabet of fewer than 256 distinct delta values. I translated the remaining values to single byte values (using an order-preserving translation), and ran the full 90B assessment on the resulting data. This approach should assess higher than the procedure you outline, as it does not truncate the data to the lowest order nibble.

May I ask how exactly you created the bit-string for the 90B tool? If you concatenate the 64 bit time values, how did you convince the 90B tool that you have 64 bit symbols knowing that the Markov test only uses 6 or 8 bits as maximum width?

To me, that was always the core problem and thus I applied the truncation as otherwise the most-significant bits which hardly move are treated as individual symbols which naturally lowered the entropy rate calculated by 90B.

I use the default Makefile.hashtime, which compiles everything using -O2. I later noticed that disabling optimization has a substantial impact (testing time increases by a factor of about an order of magnitude), but I haven't yet looked into why that is. It also looks like the default library Makefile also uses -O2, so this behavior is likely the relevant behavior in any case.

Yes, -O2 should be considered.

  • For the base behavior (rdtsc, no attempt at instruction serialization, no core pinning), I got an assessment of about 0.36 bits per raw symbol.

With the statement above, could you please send me the precise steps of how the data is converted for the 90B tool and how you invoke the 90B tool?

  • When I pinned the jitterentropy-hashtime process to a particular core, I got an assessment of about 0.41 bits per raw symbol.

Pinning should not deliver any additional value.

  • When I added a CPUID instruction prior to the rdtsc instruction to force in-order execution, then I get an assessment of about 0.23 bits of min entropy per raw symbol.
  • If I use a rdtscp instruction (with no cpuid instruction) instead of the rdtsc instruction, I get about 1.68 bits of min entropy per raw symbol. I have no idea why this is so different; I was expecting a result similar to the cpuid/rdtsc combination.

My main take away here is that the underlying assumption of a worst-case 1 bit per jent_measure_jitter call probably isn't conservative, and that an oversample rate of >= 5 seems like a good idea on my particular system; it seems likely that this is also true for other modern Intel-based systems.

Note, based on your and Kurt's statements, I started to get rid of the changing number of SHA-3 invocations. Basically, SHA-3 is invoked once and done - which is exactly what the 2nd column of data gives that you and Kurt based the assessment on.

To counter that, the oversampling rate is increased to at least 5 (the caller can request higher oversampling rates, but the minimum would be 5). That means to get to 256 bits of output data, the RNG would now use 1280 time delta values. The minimum implied entropy per time delta is 256 / 1280 = 0.2. This approach should cover your concerns IMHO. But I would be glad to hear your opinion.

Ciao Stephan

smuellerDD commented 3 years ago

Am Sonntag, dem 21.03.2021 um 09:45 -0700 schrieb Kurt Roeckx:

This is a version of the test program I was using: jitterentropy-hashtime.c.gz (Based on jitterentropy-testing-20201205).

Thank you for sharing it. So basically you print out 4096 time deltas collected at once.

The attached version doesn't give reproducible results for me on my non-idle desktop, but I get something in the range of 0.15 - 2.4, depending on the run.

Could you please help me how you translate the 64 bit values into data processed that returns you the entropy value? As I outlined to the answer to Joshua, I always had difficulties to process large values and thus applied the truncation.

If I change things like the SIZE, I get different results. For instance when I change it from 4096 to 256 it returns value in the order of 2.7 - 4.9.

That change implies that it may introduce variations from the fwrite operation as you rightfully indicated before and which caused the change to the tool where I allocate sufficient memory on the heap before and collect all data before doing an fwrite.

So we should dismiss the higher entropy values from the lower SIZE value.

If I then uncomment the lines that are commented out, it changes to be between 4.5 and 5.7. So my best estimate is that it changed from 0.15 bit of entropy to 4.5 bit of entropy, just by doing something external to the measurement, and that 0.15 is most likely an overestimate.

smuellerDD commented 3 years ago

Am Montag, dem 22.03.2021 um 18:14 -0700 schrieb Joshua E. Hill:

I've been testing using the currently checked in version in the master branch for both the library and the testing (because it includes a bunch of SP 800- 90B compliance changes that I care about).  I have also taken it on myself to change jitterentropy-hashtime.c to:

  1. Disable the (pseudo-)random round data production (I'm not using the data, and the testing would take much longer with that enabled).

  2. Set ec_min->fips_enabled to 1 (as pointed out in issue #32)

  3. Changed the "initialization" jent_measure_jitter prior to the fixed-round outputs to use a fixed round

  4. Output a single file across all rounds in a binary format (uint32_t values) that I use in other tools

Agreed to all. With the test code currently present in https://github.com/smuellerDD/jitterentropy-library/blob/master/tests/raw-entropy/recording_userspace/jitterentropy-hashtime.c these should not be needed any more though.

I don't think that any of these changes should affect the results, I made them to make large scale testing a bit easier.

Agreed.

I ran tests on a few variations of the __x86_64__ jent_get_nstime implementation. The version checked in uses inline assembly to perform a rdtsc instruction. I made a few variants of this approach:

  1. A version where I use a __rdtsc() compiler intrinsic instead of the inline assembly. (This version is intended to have the same characteristics as the checked in version). This is labeled as the rdtsc variant.

Agreed.

  1. A version where I use a compiler intrinsic to add a cpuid instruction prior to the rdtsc instruction. This is labeled as the cpuid variant.

Agreed - this should do the serialization.

3.  A version where I use a rdtscp compiler intrinsic instead of the rdtsc compiler intrinsic. This is labeled as the rdtscp variant.

Ok.

  1. A version where I apply all the fences noted in the intel architecture guide for the rdtscp instruction to force instruction ordering / memory commitment before and after the instruction:  the order discussed is mfence then rdtscp then lfence. This is labeled as the the fenced variant.

Ok.

Each round, I created 1 million delta values. I ran 1000 rounds of each of these variants pinned to a single core, and once for the rdtsc variant, but not pinned to a specific core.

Perfect.

I used clang-11.1.0 to compile all the binaries. As the optimizer seems to have a major impact on the performance of this system, I tested using -O0, - O1, -O2, and -O3.

Great.

I conducted all testing on a fairly quiescent modern bare-metal Intel-based (Cascade lake) system.

Here are histograms for the delta values and the assessments for these various situations. If you want to see this data in some other format, let me know.

This is all awesome data!

Would you please be so kind and make the raw values available?

The only initial challenge I had with the data was that the histograms sometimes have dissimilar sizes in the bars (i.e. sometimes one bar is 20 cycles in width, sometimes 50) making a comparison a bit challenging. But once I considered this, the histograms make perfect sense.

Thanks a lot for it!

smuellerDD commented 3 years ago

Am Montag, dem 22.03.2021 um 22:39 -0700 schrieb Joshua E. Hill:

I'm not sure that it's completely clear what conclusions to draw, honestly. :-) Here are few:

Let us try together :-)

  1. Clearly the optimizations used make a big difference, which is already flagged in Stephan's extensive documentation. That being said, it seems reasonable to change the default Makefiles to make "no optimization" default (using -O0 instead of the current flag -O2).

Yes, the optimizations do indeed make a difference. And to make it more clear what exactly is executed, I should turn this back to -O0 - over time compilers may vary with their optimizations - and I would not like to rest my assessment on how good or bad these optimizations are.

I will change the optimization after I made more studies whether I can lower the oversampling rate I suggested in the posts before.

  1. The delta distribution doesn't seem to have a obvious correlation with the assessed entropy. In instances where really low entropy assessments are produced, this seems mainly (perhaps solely?) due to a low entropy assessment in the t-tuple estimator, which is a test that can perform poorly with nearly any given delta histogram. The LRS, lag, MultiMMC, and compression estimators also seems to provide notably low assessments in these cases. Together, these imply that reoccurring multi-symbol patterns can become a problem in this source. It may be a good idea to implement a health test that watches for such low-entropy conditions. The existing APT / RCT are largely insensitive to this type of defect.

What window size would you consider for such multi-symbol pattern test?

  1. I was expecting that pinning the testing process to a single core would decrease the assessed entropy, but it appears to have only modest affect on the tested hardware (and the direction of the effect wasn't consistent!).

If you look at my test I outline in section 6.3 of my writeup, I even have only one CPU available.

The tested architecture synchronizes the TSC between all cores on the same processor (and there is only one processor on this particular machine), so it may be that pinning isn't really all that important in that context. I'll re-run the other test variants without pinning and try to see what happens.

  1. I wasn't (and still am not) really sure what the appropriate approach to extracting the TSC is. The "fenced" variant of rdtscp seems the best way to accurately measure the variation association with the memory task / SHA-3 conditioning, but in at least some circumstances, it seems that the variation due to when the rdtsc is actually run provides a substantial variation. In addition, the necessary fences are only available with SSE2, and rdtscp is only available in Nehalem and later. The cpuid approach would work, but is more disruptive than the "fenced rdtscp" approach. In addition, to get all the benefits of the "fenced rdtscp" approach, you would need to run he cpuid before and after the rdtsc instruction, I think.

To be honest, I am not sure I would like to go into the details how a time stamp is produced. As long as it is produced and is sufficient fine grained, I would like to not further think about it. Otherwise it becomes very tedious work to apply this assessment to every CPU architecture.

  1. This may just be my ignorance and persistent failure to really learn GCC's inline assembly syntax and conventions, but I think that using compiler intrinsics is a bit easier to read than the existing inline assembly. This may impose some compiler restrictions (but then, using GCC's inline assembly syntax also does so).

Agreed - what you see is legacy. I started the work long ago and there were no such intrinsics available to my knowledge.

  1. I've done quite a bit of testing of prior versions of this project on other architectures, but not with this new code base yet. My suspicion is that the underlying assessment distribution is going to vary significantly with the underlying hardware details, so I think that it is going to be hard to support a broad hardware-independent entropy lower bound claim. (This certainly seemed to be the case with the prior versions that I have tested across several architectures.)

I agree that the distributions vary. But my goal is that the RNG has an implied heuristic entropy value based on how many time deltas it collects to gather 256 bits of entropy. All that needs to happen is that this value is compared with the measurements and to show that the measurements conclude that there is always more entropy available. That said, I agree that testing would need to be conducted for each CPU type on a given system, but only to the extent that the result is compared with the heuristic implied entropy. And I think this work is not that challenging given the 90B tools we have.

smuellerDD commented 3 years ago

Am Mittwoch, dem 24.03.2021 um 14:39 -0700 schrieb Joshua E. Hill:

Thanks @kroeckx, that was very helpful!

Your comment helped me firm up some of my fuzzy thinking with respect to where to introduce a synchronizing instruction.

Joshua, maybe you want to consider the analysis I provide in the document section 6.1.1.

The first bullet gives you an idea where the worst case is.

smuellerDD commented 3 years ago

Am Donnerstag, dem 25.03.2021 um 15:50 -0700 schrieb Joshua E. Hill:

I ran a few different configurations (mostly) using the approach that I outlined above. For the non-optimized code (-O0) the resulting assessment was in the interval [2.63, 5.75] for all buffer sizes. (the median assessments were all above 3.8).

I then ran through optimized versions of the same process (using '-O2') and ran two runs, one where I allowed the cpuid instruction between "natural samples" to be optimized out, and one where I forced it to be in place.

For the version where I forced the serializing instruction cpuid to run, I found that the lowest observed assessed median entropy was  0.05 bits of min entropy with a buffer with 2^15 different values. I stored 16-bit values in this test, so that corresponds to a 2^16 byte buffer. The observed assessments were all in the interval [0.03, 2.42].

For the version where the cpuid was optimized out, the lowest observed assessed median entropy was 0.08, with the observed assessments being in the interval [0.04, 2.55].

The versions with the cpuid included between "natural samples" seemed to be much more consistent, and those median assessments seemed to mostly decrease as the buffer size increased.

Thanks for the analysis. But may I ask what the analysis with the CPUID helps us for the production case (I see that it is a good study object to gain more insight though)? The code does not have a CPUID instruction. An attacker can use the CPUID in adjacent processes to try to subvert the entropy collection. But based on my considerations outlined in section 8.1.1 I only see one way of interference.

But in any case, your data gives me one more incentive to use -O0 again.

Thanks a lot

joshuaehill commented 3 years ago

May I ask how exactly you created the bit-string for the 90B tool? If you concatenate the 64 bit time values, how did you convince the 90B tool that you have 64 bit symbols

There isn't a really great answer; the 90B estimators are specified as working only up to 256 symbols (8-bit width), so to use the 90B statistical assessment process, one needs to map down the data somehow. I use a few types of translation, all of which I wrote. The tools that I use in each evaluation depend on the source.

If the noise source is expected to produce large scale shifts and low end variation should be ignored (e.g., a fancy quantum source which only produces 16 widely separated states, whose output is read using a 16-bit ADC. The low order bits are expected to be electrical noise, which is not the noise under analysis) then I group together adjacent values until the population of the groups is roughly equal using highbin. This translation approach doesn't appear to be relevant to the jitter entropy library, where the most important bits are actually the low order bits that are effectively discarded in this translation approach.

If the noise source entropy is expected to be distributed on a per-bit basis (i.e., as presumed in SP 800-90B Section 6.4), I can find candidate bit selections using the selectbits tool, which performs a corrected 2016-draft Markov test on each possible selection of bits, up to some specified number and tells you the apparently most successful selections for each number of bits tested. Data with the chosen selection bitmasks can then be extracted using extractbits. This is effectively a generalization of your "truncate the low order nibble" approach, but it automatically avoids keeping fixed bits (e.g., the lsb of a tsc delta appears to always be even for some reason?)

If there are a small number (<=256) of symbols left after diverse outliers are removed, then the non-outlier data be translated (using an order-preserving translation) to 8-bit symbols using u32-keep-most-common.

If there are important design-oriented reasons why specific intervals should produce distinct symbols, then I can use u32-manbin

There are a few other tools to deal with all sorts of odd conditions, but those are the main approaches that I use. They are all included in Theseus, along with the my independent 90B test implementations.

knowing that the Markov test only uses 6 or 8 bits as maximum width?

This was the case for the SP 800-90B 2012 and 2016 drafts, but not the final version of SP 800-90B, where the markov test now is only run on the bitstring version of the input data.

The fundamental restriction here is that as you increase the number of symbols, you need to (at least) linearly increase the number of samples in the data sets. Said differently, the required data sample size increases exponentially with the bit width of the data symbols. The final version of SP 800-90B is also quite clear that they've only specified everything to work up to 8-bit symbols, so the NIST tool only supports that symbol width.

Historically, my own tool (which was developed initially against the 2016 draft of SP 800-90B) allowed for wider data, but I haven't tested that functionality in quite a while, and have little confidence that wider symbols would work at this point.

Would you please be so kind and make the raw values available?

I'd be happy to; I presently have about 52 GB of data on hand (it will be less after I compress, but still a awkwardly large data set). Which data sets do you want, and how would you like me to transfer them to you?

What window size would you consider for such multi-symbol pattern test?

I don't have a specific answer for you here yet. Let me spend a bit of time trying to construct an online health test from the existing SP 800-90B lag estimator and get back to you with a more fully formed answer.

To be honest, I am not sure I would like to go into the details how a time stamp is produced. As long as it is produced and is sufficient fine grained, I would like to not further think about it. Otherwise it becomes very tedious work to apply this assessment to every CPU architecture.

I think that this approach is fine in almost all circumstances. When producing a lower bound min entropy estimate for H_submitter, however, it is important not to artificially inflate the estimate by including variation that may not ultimately be present in use. On some older Intel platforms, the TSC was effectively per-core, and there could be substantial differences between the TSC values for each core. In these situations, I think that, when performing testing to support H_submitter, the process should be pinned to a specific core. I think that my testing indicates that this isn't necessary for more modern Intel hardware.

I started the work long ago and there were no such intrinsics available to my knowledge.

As an aside, it would appear that optimizers are somewhat more willing to optimize out intrinsic functions than inline assembly, so there may well be a reasonable reason to leave at least some inline assembly.

Thanks for the analysis. But may I ask what the analysis with the CPUID helps us for the production case?

For the final code used in production, I don't think that it should be present. For jitterentropy-hashtime, I inserted a cpuid serialization instruction between each group of 256 * osr jent_measure_jitter calls, because I wanted to see if preventing out-of-order execution and loads/stores across these "natural samples" changed anything. In the end, it seemed to shift the per-symbol entropy rate down, and in the case where -O2 was used, it seemed to make a significant difference.

The reason to use the with-cpuid variant is that this is a closer approximation of how the library is actually used. In jent_read_entropy, the jent_random_data function uses groups this many jent_measure_jitter calls, and in between these calls other code is run.

joshuaehill commented 3 years ago

@smuellerDD: For your reference, here is the jitterentropy-hashtime.c that I was using.

joshuaehill commented 3 years ago

The only initial challenge I had with the data was that the histograms sometimes have dissimilar sizes in the bars (i.e. sometimes one bar is 20 cycles in width, sometimes 50) making a comparison a bit challenging

If you would like, I can force a specific bar size. Mathematica tries to guess what would work best, but sometimes that goes badly.

joshuaehill commented 3 years ago

@smuellerDD, your Jitter Entropy Library paper (Section 6.1.1) contains the following text:

When replacing the serialization function with a pipeline flush using the MFENCE instruction, the execution variations did not decrease compared to simply reading the timing values without any special treatment. Therefore, the pipeline has no effect on the CPU execution timing variations.

I'm not an Intel architecture expert by any means, but Intel's architecture guide describes MFENCE as forcing memory LOAD/STORE instructions to be "globally visible" after MFENCE, and in particular you can get strong ordering of these memory operations with respect to other MFENCE/LFENCE/SFENCE instructions and the broader serialization instructions like CPUID. The MFENCE instruction description explicitly states

MFENCE does not serialize the instruction stream.

This, as compared to the description of CPUID, which states

CPUID can be executed at any privilege level to serialize instruction execution. Serializing instruction execution guarantees that any modifications to flags, registers, and memory for previous instructions are completed before the next instruction is fetched and executed.

My read of the above is that CPUID affects the pipelining of instructions, so CPUID could be used to remove variation due to differences in instruction pipelining. MFENCE would instead remove variation due to differences in LOAD/STORE commitment timing. As such, your test showed that ordering of memory LOAD/STORE commitments (not instruction pipelining) had no effect on CPU execution timing variations.

Based on my testing, instruction pipelining seems to have a substantial effect on the delta distribution for some compiler optimization levels.

smuellerDD commented 3 years ago

Am Freitag, dem 26.03.2021 um 13:12 -0700 schrieb Joshua E. Hill:

If the noise source entropy is expected to be distributed on a per-bit basis (i.e., as presumed in SP 800-90B Section 6.4), I can find candidate bit selections using the selectbits tool, which performs a corrected 2016-draft Markov test on each possible selection of bits, up to some specified number and tells you the apparently most successful selections for each number of bits tested. Data with the chosen selection bitmasks can then be extracted using extractbits . This is effectively a generalization of your "truncate the low order nibble" approach, but it automatically avoids keeping fixed bits (e.g., the lsb of a tsc delta appears to always be even for some reason?)

Thank you for all the test references. Our extraction tool allows specifying a mask that is applied to the 64 bit values. Commonly we use 0xff and 0x0f. But any other value can be used including leaving out some bits on the low end. I used that for analyzing systems where the time stamp does not change the lowest bit (i.e. bit 0 is always a fixed value).

Given your examples, I would believe that our extraction tool therefore should be appropriate.

knowing that the Markov test only uses 6 or 8 bits as maximum width?

This was the case for the SP 800-90B 2012 and 2016 drafts, but not the final version of SP 800-90B, where the markov test now is only run on the bitstring version of the input data.

The fundamental restriction here is that as you increase the number of symbols, you need to (at least) linearly increase the number of samples in the data sets. Said differently, the required  data sample size increases exponentially with the bit width of the data symbols. The final version of SP 800-90B is also quite clear that they've only specified everything to work up to 8-bit symbols, so the NIST tool only supports that symbol width.

Historically, my own tool (which was developed initially against the 2016 draft of SP 800-90B) allowed for wider data, but I haven't tested that functionality in quite a while, and have little confidence that wider symbols would work at this point.

Given that, I think calculating the data for the 4 and 8 LSB with our truncation tool should therefore be consistent with this limitation.

Would you please be so kind and make the raw values available?

I'd be happy to; I presently have about 52 GB of data on hand (it will be less after I compress, but still a awkwardly large data set). Which data sets do you want, and how would you like me to transfer them to you?

Maybe you could place it on some file server where I can download it? If that will not work, I could arrange some file space on one of my servers for you to upload the data.

What window size would you consider for such multi-symbol pattern test?

I don't have a specific answer for you here yet. Let me spend a bit of time trying to construct an online health test from the existing SP 800-90B lag estimator and get back to you with a more fully formed answer.

I am also thinking about it and will let you know my results.

Thank you very much Stephan

smuellerDD commented 3 years ago

Am Freitag, dem 26.03.2021 um 13:53 -0700 schrieb Joshua E. Hill:

@smuellerDD: For your reference, here is the jitterentropy-hashtime.c that I was using.

Thanks a lot - I see no issues with it.

smuellerDD commented 3 years ago

Am Freitag, dem 26.03.2021 um 15:00 -0700 schrieb Joshua E. Hill:

The only initial challenge I had with the data was that the histograms sometimes have dissimilar sizes in the bars (i.e. sometimes one bar is 20 cycles in width, sometimes 50) making a comparison a bit challenging

If you would like, I can force a specific bar size. Mathematica tries to guess what would work best, but sometimes that goes badly.

After getting used to it, I am fine now. I only made the statement that potentially for maybe some not so math-savvy folks it could be easier for them if such graphs could easily be compared.

But no extra work for me, please. :-)

smuellerDD commented 3 years ago

Am Montag, dem 29.03.2021 um 10:57 -0700 schrieb Joshua E. Hill:

@smuellerDD, your Jitter Entropy Library paper (Section 6.1.1) contains the following text:

When replacing the serialization function with a pipeline flush using the MFENCE instruction, the execution variations did not decrease compared to simply reading the timing values without any special treatment. Therefore, the pipeline has no effect on the CPU execution timing variations.

I'm not an Intel architecture expert by any means, but Intel's architecture guide describes MFENCE as forcing memory LOAD/STORE instructions to be "globally visible" after MFENCE, and in particular you can get strong ordering of these memory operations with respect to other MFENCE/LFENCE/SFENCE instructions and the broader serialization instructions like CPUID. The MFENCE instruction description explicitly states

MFENCE does not serialize the instruction stream.

This, as compared to the description of CPUID, which states

CPUID can be executed at any privilege level to serialize instruction execution. Serializing instruction execution guarantees that any modifications to flags, registers, and memory for previous instructions are completed before the next instruction is fetched and executed.

My read of the above is that CPUID affects the pipelining of instructions, so CPUID could be used to remove variation due to differences in instruction pipelining. MFENCE would instead remove variation due to differences in LOAD/STORE commitment timing. As such, your test showed that ordering of memory LOAD/STORE commitments (not instruction pipelining) had no effect on CPU execution timing variations.

Based on my testing, instruction pipelining seems to have a substantial effect on the delta distribution for some compiler optimization levels.

I am also no Intel CPU instruction expert, but to my knowledge, the CPUID instruction is a serialization instruction that clears out all types of internal state including pipelines. The MFENCE is the strong ordering which ultimately is also used as memory barrier.

Actually, the test I describe was intended to cover the memory barrier logic to see whether it has an impact on variations. I agree that the document is not clear here and thus I will change it.

Thank you Stephan

smuellerDD commented 3 years ago

@kroeckx @joshuaehill I have pushed the code for disabling the SHA3 execution variation and the use of -O0. Comments are welcome. Thanks.

joshuaehill commented 3 years ago

I was looking more into the LFENCE/SFENCE/MFENCE instructions, and their precise functionality is strikingly unclear. LFENCE (but not SFENCE or MFENCE) also serializes the instruction stream to some degree (In fact, I think that the nominal memory effects from the LFENCE instruction are already be guarantees made by the current Intel processors?). From the Intel Instruction Set Reference, LFENCE:

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. In particular, an instruction that loads from memory and that precedes an LFENCE receives data from memory prior to completion of the LFENCE. (An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible.) Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute (even speculatively) until the LFENCE completes.

There is a lot of contradictory information throughout StackExchange on this topic, but it is clear that the relationship between LFENCE / SFENCE / MFENCE is much more complicated than it would seem from the instruction naming. :-)

smuellerDD commented 3 years ago

Am Dienstag, dem 30.03.2021 um 09:11 -0700 schrieb Joshua E. Hill:

There is a lot of contradictory information throughout StackExchange on this topic, but it is clear that the relationship between LFENCE / SFENCE / MFENCE is much more complicated than it would seem from the instruction naming. :-)

Good, so I am not the only one feeling a little bit stupid when reading the documentation :-D

joshuaehill commented 3 years ago

I did a set of tests/statistical assessment runs with the new behavior, and everything seems well behaved. I'll follow up with an actual statement of results when I have a bit of time, but in summary everything looks good with the new behavior.

smuellerDD commented 3 years ago

Thank you very much @joshuaehill and @kroeckx .

@kroeckx I hope it is ok with you that I close the issue. I would like to prepare a new release with the applied changes. If you have more issues or comments, please feel free to re-open this ticket or open a new one.