Explore optimizations for scan_aes

jonstewart commented 3 years ago

The AES key scheduler scanner can provide a lot of value but it is also often the slowest scanner by a wide margin. The implementation is scalar and due to Jesse Kornblum’s findaes C utility.

As the scanner shingles across a byte array to check whether the array may be a valid key schedule. Some of the logic here may be amenable to SIMD techniques. If so, significant performance gains would be likely.

jonstewart commented 3 years ago

schedule_core() (https://github.com/simsong/bulk_extractor/blob/slg-dev/src/scan_aes.cpp#L214) is called in a loop in each valid_aesKKK_schedule() function, and valid_aes128_schedule(), valid_aes_192_schedule(), valid_aes256_schedule() are called on every byte of the sbuf, shingled-fashion.

So, we want schedule_core() to be fast. The input is 4 bytes. There are essentially three operations: rotating the 4 bytes to the left by 8 bits, replacing each byte with an sbox/table lookup on itself, and then doing an XOR on the last byte with a value in a separate lookup table.

I think the rotate and the sbox lookup operations are independent, i.e., their order could be swapped. The XOR operation on the last byte is not.

The rotate-left function is in scan_aes.h (https://github.com/simsong/bulk_extractor/blob/slg-dev/src/scan_aes.h#L31). It's inlined, but as the comment notes, "it should be recoded as a rol instruction" in x86.

The sbox lookups are harder to optimize. It's a fully 256-way mapping. There are clever optimization tricks with vector shuffles if you're mapping byte values to a small number of buckets (no more than 16). ARM has the TBL instruction. AVX512-VBMI can maybe do a full 256-way mapping, but that's going to be seen only on new servers, and perhaps never on workstations.

A bigger opportunity lies within the validate_aesKKK_schedule() functions by not considering 32 bits in isolation with separate calls to schedule_core(). For example, while rol on each 32 bit block is better than the separate assignments in the current code, a vector shuffle operation could do this to 4 or 8 32 bit blocks. Vector shuffles are fast operations; the only (and significant!) concern would be the latency involved. Regardless, it's likely the case that operating in 32 bit blocks with a check after each is not as fast as operating in 128/256 bit chunks where possible, and maybe even computing on a few of those chunks before making tests to exit.

simsong commented 3 years ago

As you can see from the comments, I researched the rol. It doesn't actually need to be inlined; there is a compiler optimization that automatically creates the rol instruction if you use a well-known idiom. I didn't bother with this because the existing code works and because there is so much other stuff going on in this function that simply fixing the rol would probably not produce measurable performance improvements.

I've thought a lot about this. This is the slowest scanner, but it's not dramatically slower than the big flex scanners. I think that the way to make this scanner better is to limit when it is used. Specifically:

[ ] there's no reason for it to be used on ASCII text. Perhaps this could get input from other scanners and if another scanner finds data in a 512-byte region, we know not to run this scanner?
[ ] It's looking for a key schedule, so it's probably should be limited to larger blocks of memory.
[ ] Perhaps it could be run on a GPU using SIMD instructions, so that it's simultaneously analyzing the buffer p, p+1, p+2, etc.
[ ] Perhaps we can exploit alignment and only check p, p+2, p+4 ?
[ ] Perhaps it should be disabled by default?

Now that the BE2.0 architecture is working---two years later than originally expected---we can think about adding the file-based iterator, and first mapping out all of the files, processing them, and then processing the non-file space. There is no reason for scan_aes to be run on the context of a Microsoft Word file or a PDF. Even simply tagging which blocks within a file system have these files will go a long way to improving performance. I think that these approaches are more likely to payoff than the micro-optimizations we are discussing above.

jonstewart commented 3 years ago

Mostly these are notes as I work through my understanding of the code, not proposed courses of action just yet.

I agree that macro-optimizations have the most to offer. The approach seems to be to rebuild the key schedule incrementally and verify its validity, block by block. I wonder:

whether there’s a way to do a quick sanity check between one or more pairs of bytes in the putative key and schedule without going through the process of recreating the schedule and verifying it; that could save many instructions per byte, especially with branch prediction.
whether there’s a way to apply such a sanity check across 128, 192, and 256 bit keys, rather than considering each in turn, in separate functions.
whether there’s any way to use such testing to skip a few bytes, in a manner roughly akin to how Boyer-Moore can skip, that is, if we know a test of a byte in a key window and further values in the schedule fails, then can we assert that the byte can’t be valid at any position the key and we can skip ahead? I’m not sure, I need to work through key schedule algorithm more thoroughly with this in mind to see. Only the last byte in a block gets xor’d so maybe some positions can be chosen to allow for quick tests with possible skips.

Also, in the vein of considering pragmatics: is it worth checking for 192 bit keys? Who’s ever used 192 bit AES?

Jon

Sent from my iPhone

On Aug 31, 2021, at 6:59 AM, Simson L. Garfinkel @.***> wrote:

As you can see from the comments, I researched the rol. It doesn't actually need to be inlined; there is a compiler optimization that automatically creates the rol instruction if you use a well-known idiom. I didn't bother with this because the existing code works and because there is so much other stuff going on in this function that simply fixing the rol would probably not produce measurable performance improvements.

I've thought a lot about this. This is the slowest scanner, but it's not dramatically slower than the big flex scanners. I think that the way to make this scanner better is to limit when it is used. Specifically:

there's no reason for it to be used on ASCII text. Perhaps this could get input from other scanners and if another scanner finds data in a 512-byte region, we know not to run this scanner? It's looking for a key schedule, so it's probably should be limited to larger blocks of memory. Perhaps it could be run on a GPU using SIMD instructions, so that it's simultaneously analyzing the buffer p, p+1, p+2, etc. Perhaps we can exploit alignment and only check p, p+2, p+4 ? Perhaps it should be disabled by default? Now that the BE2.0 architecture is working---two years later than originally expected---we can think about adding the file-based iterator, and first mapping out all of the files, processing them, and then processing the non-file space. There is no reason for scan_aes to be run on the context of a Microsoft Word file or a PDF. Even simply tagging which blocks within a file system have these files will go a long way to improving performance. I think that these approaches are more likely to payoff than the micro-optimizations we are discussing above.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

simsong commented 3 years ago

I have never encountered 192 bit keys. I asked member of the AES team why 192 was even in the standard. The answer I got was not satisfactory to me. My understanding is that some people thought that 128 didn't provide a large enough safety margin, and 256 was too big, so they split the difference and offered 192.

If we are checking for 192 bit keys, that would be an easy thing to make optional.

jonstewart commented 3 years ago

Ok, I made issue #231 and will submit a PR momentarily.

simsong commented 3 years ago

Okay, I measured the overhead of my high-bit check and the distinct-count check and discovered that their overhead was more than what they were saving.

out-domexusers-be20v31/report.xml:      <scanner><name>aes</name><ns>558034843919</ns><calls>454</calls></scanner>
out-domexusers-be20v31/report.xml:      <feature_file><name>aes_keys</name><count>2</count></feature_file>
out-domexusers-be20v31_no_hbc/report.xml:      <scanner>aes</scanner>
out-domexusers-be20v31_no_hbc/report.xml:      <scanner><name>aes</name><ns>498946559495</ns><calls>454</calls></scanner>
out-domexusers-be20v31_no_hbc/report.xml:      <feature_file><name>aes_keys</name><count>2</count></feature_file>
out-domexusers-be20v31_no_hbc_no_distinct/report.xml:      <scanner>aes</scanner>
out-domexusers-be20v31_no_hbc_no_distinct/report.xml:      <scanner><name>aes</name><ns>255327033498</ns><calls>454</calls></scanner>
out-domexusers-be20v31_no_hbc_no_distinct/report.xml:      <feature_file><name>aes_keys</name><count>2</count></feature_file>

So I'm taking them out.

simsong commented 3 years ago

So to summarize the optimizations:

We were computing the sboxes each time through. Now we just compute them at the beginning.
Entropy probing removed. It turns out that the sbox computations are validated as each row is checked, rather than having all of the rows created and then checked.

I think that it may be possible to get rid of the first memcpy....

simsong commented 3 years ago

Getting rid of the memcpy adds a conditional and the code ends up running slower:

With the memcpy:

out-domexusers-be20v32/report.xml:      <scanner><name>aes</name><ns>268432665179</ns><calls>454</calls></scanner>
out-domexusers-be20v32/report.xml:      <feature_file><name>aes_keys</name><count>2</count></feature_file>

Without:

out-domexusers-be20v33/report.xml:      <scanner>aes</scanner>
out-domexusers-be20v33/report.xml:      <scanner><name>aes</name><ns>272627556594</ns><calls>454</calls></scanner>
out-domexusers-be20v33/report.xml:      <feature_file><name>aes_keys</name><count>2</count></feature_file>

Here's what was changed:

diff --git a/src/scan_aes.cpp b/src/scan_aes.cpp
index 5bc4a13..30688b0 100644
--- a/src/scan_aes.cpp
+++ b/src/scan_aes.cpp
@@ -311,7 +311,8 @@ bool valid_aes256_schedule(const uint8_t * in)
     uint8_t pos = AES256_KEY_SIZE;
     uint8_t i = 1;

-    memcpy(computed, in, AES256_KEY_SIZE);
+    //memcpy(computed, in, AES256_KEY_SIZE);
+    const uint8_t *computed_src = in;   // while pos < AES256_KEY_SIZE

     // We need 11 sets of sixteen bytes each for 256-bit mode
     while (pos < AES256_KEY_SCHEDULE_SIZE)   {
@@ -331,7 +332,10 @@ bool valid_aes256_schedule(const uint8_t * in)
         }

         for (uint8_t a = 0; a < 4 && pos<AES256_KEY_SCHEDULE_SIZE; a++)     {
-            computed[pos] = computed[pos - AES256_KEY_SIZE] ^ t[a];
+            if (pos == AES256_KEY_SIZE*2 ) {
+                computed_src = computed;
+            }
+            computed[pos] = computed_src[pos - AES256_KEY_SIZE] ^ t[a];

             // If the computed schedule doesn't match our goal,
             // punt immediately!

simsong commented 1 year ago

I'm closing this. We explored the options and came up empty. The main improvement would be running on a SIMD machine, and I just don't think that it's worth it. We removed AES-192. This can be re-opened in the future if there are fundamentally new architectures and this is a good use of some undergraduate's time.

simsong / bulk_extractor

Explore optimizations for scan_aes #158