kokke / tiny-AES-c

Small portable AES128/192/256 in C
The Unlicense
4.13k stars 1.29k forks source link

Reduce size even more: generate s-box table? #228

Open Midronome opened 1 month ago

Midronome commented 1 month ago

Hi :)

First a HUGE thanks for making this and releasing it publically this way, really awesomely useful :))

I'm implementing this on a tiny uC and I'm trying to reduce (compiled) code size as much as possible. As suggested in the readme, I'm only using CTR.

The aes.c file mentions this:

// The lookup-tables are marked const so they can be placed in read-only storage instead of RAM
// The numbers below can be computed dynamically trading ROM for RAM - 
// This can be useful in (embedded) bootloader applications, where ROM is often limited.
static const uint8_t sbox[256] = {

I've been looking around the net for a way to generate this tables (I'm not using rsbox since I'm only using CTR), the best I could find is this: https://crypto.stackexchange.com/questions/98396/how-does-this-code-calculating-aes-s-box-work But I am not sure how to get the value of p from the polynomial 1 + x + x^3 + x^4 + x^8 in C code?

Thank you!

Midronome commented 1 month ago

Nervermind I found it in the comments, p is set to the polynomial with x=2, i.e. p=283.

So putting sbox in RAM and adding this function reduces the code size by 212 bytes :) (but obviously increases the RAM usage by 265 bytes)

void AES_generateSbox(void) {
    uint32_t p = 283;

    uint32_t t[256];
    uint32_t x = 1;
    for (int i = 0; i < 256; i ++) {
        t[i] = x;
        x ^= (x << 1) ^ ((x >> 7) * p);
    }

    sbox[0] = 0x63;
    for (int i = 0; i < 255; i ++) {
        x = t[255 - i];
        x |= x << 8;
        x ^= (x >> 4) ^ (x >> 5) ^ (x >> 6) ^ (x >> 7);
        sbox[t[i]] = (x ^ 0x63) & 0xFF;
    }
}

I checked and the generated sbox is correct.

If anybody has any ideas on how to reduce the function's code size, I'm happy to hear it :)

Midronome commented 1 month ago

Just FYI I found a few more minor optimizations, when compiling for a 32-bit processor. All the operations are byte-based, while for a 32-bit processor doing operations as 32-bit words (when possible) is much more efficient, saving both space and computing time. Fortunately the compiler optimizes already most of the possible ones, and there are many which are not optimizable (for example when working on 4 non-adjacent bytes).

Replacing all the uint8_t i with uint32_t i saved a couple of bytes.

Since I'm using only CTR most of the functions benefit from being inlined, so I added a bunch of "inline" on these functions' declaration... Only to find out that the compiler was inlining them already :D

This last optimization, in KeyExpansion() saved a few more (about 30?):

    // The first round key is the key itself.
    for (i = 0; i < Nk; ++i)
    {
        ((uint32_t*)roundKey)[i] = ((uint32_t*)AES_key)[i]; // optimized version of the commented 4 lines below
        /*roundKey[(i * 4) + 0] = AES_key[(i * 4) + 0];
        roundKey[(i * 4) + 1] = AES_key[(i * 4) + 1];
        roundKey[(i * 4) + 2] = AES_key[(i * 4) + 2];
        roundKey[(i * 4) + 3] = AES_key[(i * 4) + 3];*/
    }

With these, as well as removing the AES_ctx structure (fixed key stored in ROM), I reduced the code size down to 684 bytes (including the key in ROM).

paulmenzel commented 1 month ago

Nice. It’d be great if you sent a merge/pull request.

algrobman commented 4 weeks ago

there are a few more cases where the code could be optimized for speed and maybe code size, assuming the buffers are DW aligned. the block structure can be represented by array of 2 uint64_t. uint64_t will be emulated with 32 bit operations on 32 bit machine and utilize power of 64 machine.

1) memcpy - all data movement are limited to 16 bytes ( AES_BLOCKLEN):

void aes_memcpy(uint8_t *to, const uint8_t *from, size_t len){
// always copies 16 bytes 
    uint64_t *to64, *from64;
    from64 = (uint64_t *)from;
    to64   = (uint64_t *)to;
    to64[0] = from64[0];
    to64[1] = from64[1];
}

2) XOR with Iv can be recoded as :

static void XorWithIv(uint8_t* buf, const uint8_t* Iv){
    uint64_t *buf_ptr, *Iv_ptr;
    buf_ptr = (uint64_t *) buf;
    Iv_ptr = (uint64_t *) Iv;
    buf_ptr[0] ^= Iv_ptr[0];
    buf_ptr[1] ^= Iv_ptr[1];
}

3) finally : (there is no need to worry about 8 bit counters overflow and again Xor'ing can be speed up)

void AES_CTR_xcrypt_buffer(struct AES_ctx* ctx, uint8_t* buf, size_t length){

  uint8_t buffer[AES_BLOCKLEN];
  uint64_t *buffer64, *buf64;

  size_t i;
  int bi;

  buffer64 = (uint64_t *) buffer;
  buf64 = (uint64_t *) buf;

  for (i = 0; i < length; i += AES_BLOCKLEN){

      aes_memcpy(buffer, ctx->Iv, AES_BLOCKLEN);
      Cipher((state_t*)buffer,ctx->RoundKey);

      /* Increment Iv and handle overflow */
      for (bi = (AES_BLOCKLEN - 1); bi >= 0; --bi){
        ctx->Iv[bi]++;
        if(ctx->Iv[bi]) { break; }  
      }

      buf64[0] ^= buffer64[0];
      buf64[1] ^= buffer64[1];
      buf64 += 2;
  }
}

Probably 4/8 byte operations could be enforced more elegantly having a union in AES_ctx buffers with byte and DW arrays.