renesas / fsp

Flexible Software Package (FSP) for Renesas RA MCU Family
https://renesas.github.io/fsp/
Other
182 stars 81 forks source link

NetXDuo SCE9 AES-CBC encryption routine yields poor encryption performance if called with unaligned pointers #289

Closed hwmaier closed 8 months ago

hwmaier commented 10 months ago

The NetXDuo SCE9 AES-CBC encryption routine yields poor encryption performance if called with unaligned pointers. In fact the resulting encryption performance is twice as slow compared to not using SCE9 hardware accelerated encryption. This behaviour means the advantage of using SCE9 hardware accelerated encryption is diminished.

Investigation showed that the problem relates to NetXDuo passing the data in the packet->prepend_ptr to sce_nx_crypto_cbc_encrypt() which then checks for alignment to a 128 bit boundary using this code:

  if (RM_NETX_SECURE_CRYPTO_NONALIGNMENT_CHECK(input) || RM_NETX_SECURE_CRYPTO_NONALIGNMENT_CHECK(output))
  { ...

As a result of this check the encryption for a payload of 1072 bytes is is split into 67 calls executed in a loop, rather than encrypted as one single chunk.

The sce_nx_crypto_cbc_encrypt code behaves correctly and performs well if the if the alignment check is actually removed. The hardware manual also does not stipulate that an alignment would be necessary, only that the total length must be divisible by 128 bit.

As an example 1000 DTLS packets are encrypted in 2600us with SCE9 encryption and alignment check. The same operation takes 1200us when not using SCE9 and only 294us (!) if the alignment check is replaced with aif (false) clause and disabled.

It appears that the alignment check is unnecessary and the FSP mbedTLS code does not do that check either. Only the NetXDuo FSP implementation.

michaelthomasj commented 10 months ago

Hi Henrik, That code is checking 32-bit alignment: #define RM_NETX_SECURE_CRYPTO_NONALIGNMENT_CHECK(x) ((((uint32_t) (x) & 3U) != 0) ? true : false) The reason is that the driver has this declaration HW_SCE_Aes128EncryptDecryptUpdateSub (const uint32_t * InData_Text, uint32_t * OutData_Text, const uint32_t MAX_CNT), so passing in an unaligned value to InData_Text is technically undefined behavior. Since we support a variety of crypto engines with the same API, it is likely that some of them only process data in 32-bit chunks and others dont (in fact there was a bug in the past where code would crash due to trying to write data from an unaligned pointer to the hardware FIFO). But let me do some digging and get clarification on the SCE9 specifically for that operation. -michael

hwmaier commented 10 months ago

Hi @michaelthomasj,

Thank you for checking.

The datasheet does not mention an alignment requirement for the SCE9 AES engine. Only that the length must be 128 bit.

Also the mbedTLS code does not do that alignment check. And the code works without alignment.

So please check if its really a hard requirement for SCE9.

If the alignment is kept in place, SCE9 encryption will always be much slower than not using SCE9. So this doesn't make sense to me. The whole idea of using SCE9 is to be faster not slower.

hwmaier commented 10 months ago

@michaelthomasj My initial work-around was patching sce_nx_crypto_cbc_encrypt to allocate memory on the heap for input and output buffers and then copy InData_Text and OutData_Text. Then I found out that mbedTLS does work without alignment so I tried it out and it worked.

michaelthomasj commented 10 months ago

@hwmaier , You mentioned that if you removed the alignment check then the code completed in 294us with unaligned pointers.

hwmaier commented 10 months ago

I thought I mention that we are not using the encryption routines directly. This is the code we are running on the top level:

UINT rc;
NX_PACKET *packetPtr;

rc = nx_secure_dtls_packet_allocate(&dtlsSession, getPool(), &packetPtr, NX_NO_WAIT);
if (rc == NX_SUCCESS)
{
    rc = nx_packet_data_append(packetPtr, data, 1024, getPool(), NX_NO_WAIT);
    if (rc == NX_SUCCESS)
    {
        rc = nx_secure_dtls_session_send(&dtlsSession, packetPtr, &server_ip, port);
        if (rc != NX_SUCCESS)
        {
            nx_packet_release(packetPtr);
        }
    }
    else
    {
        nx_packet_release(packetPtr);
    }
}

So we are using the DTLS API from NetXDuo and NetXDuo performs the encryption of a 1024 byte payload and sends it over the network.

To answer your questions:

@hwmaier , You mentioned that if you removed the alignment check then the code completed in 294us with unaligned pointers.

The time measurements quoted are not just the encryption, the time measurement would include the network transport as well. I used Wireshark to measure the emission of the DTLS packets onto the network averaged across 1000 packets. But the network transport time is constant and the relative time difference between the different methods what matters.

The alignment is something I have NO influence over. NetXDuo allocates the buffers using nx_secure_dtls_packet_allocate(). But when you look at the implementation of nx_secure_dtls_packet_allocate you will see that the DTLS payload will never sit at an aligned address due to the DTLS record header of 13 bytes (NX_SECURE_DTLS_RECORD_HEADER_SIZE). Refer to line 114 in nx_secure\src\nx_secure_dtls_packet_allocate.c.

hwmaier commented 10 months ago

I extracted our code into a smaller test program. It can be used to connect to the following OpenSSL test server:

openssl s_server -dtls -accept 1337 -nocert -psk deadbeef -psk_identity "matilda" -cipher PSK-AES128-CBC-SHA256

Here is the NetXDuo DTLS client program code which will connect to above OpenSSL server and send 100 encrypted packets with a payload of 1024 bytes. The IP address of of the server server_ip.nxd_ip_address.v4 = IP_ADDRESS(192, 168, 0, 15); needs to be adjusted to suit.

#include <stdalign.h>
#include <stdlib.h>
#include <stdint.h>

#include "tx_api.h"
#include "nx_api.h"
#include "nx_ether0.h"
#include "nx_crypto.h"
#include "nx_secure_x509.h"
#include "nx_secure_dtls_api.h"
#include "nx_secure_dtls.h"

/* Define the ThreadX and NetX object control blocks...  */
TX_THREAD main_thread;
NX_PACKET_POOL pool_0;
NX_IP ip_0;
NX_SECURE_DTLS_SESSION dtlsSession;

extern const NX_SECURE_TLS_CRYPTO  nx_crypto_tls_ciphers;

UCHAR mainStackMemory[TX_DEFAULT_STACK_SIZE] alignas(4);
UCHAR ipStackMemory[TX_DEFAULT_STACK_SIZE] alignas(4);
UCHAR arpCacheMemory[1024] alignas(4);
UCHAR cryptoMetaData[19000] alignas(4); // Cryptography routines and crypto work buffers
UCHAR dtlsPacketBuf[4000] alignas(4); // Packet reassembly buffer for decryption

#define REMOTE_CERT_SIZE (sizeof(NX_SECURE_X509_CERT) + 2000)
#define REMOTE_CERT_CNT 3
UCHAR remoteCertsBuf[REMOTE_CERT_SIZE * REMOTE_CERT_CNT] alignas(4); // Remote certificate buffer for incoming certificates
NX_SECURE_X509_CERT *identityCertPtr;

#define DTLS_IDENTITY "matilda"
#define DTLS_KEY "\xDE\xAD\xBE\xEF" // key up to 64 bytes in length.

#define UDP_PORT 1337

UCHAR testData[1024] = { 0x55 };

/**
 * Main thread
 */
static void main_thread_entry(ULONG thread_input)
{
    UINT rc;
    NX_UDP_SOCKET sock;
    NXD_ADDRESS server_ip;

    printf("Test program started.\n");

    // Wait until link is up
    tx_thread_sleep(3 * TX_TIMER_TICKS_PER_SECOND);

    // UDP socket must be created before creating DTLS session
    rc = nx_udp_socket_create(&ip_0, &sock, "dtls", NX_IP_NORMAL, NX_DONT_FRAGMENT, NX_IP_TIME_TO_LIVE, 8192);
    assert(rc == NX_SUCCESS);

    rc = nx_udp_socket_bind(&sock, UDP_PORT, NX_WAIT_FOREVER);
    assert(rc == NX_SUCCESS);

    /* Create a DTLS session for our socket. Ciphers and metadata defined
    elsewhere. See nx_secure_tls_session_create reference for more
    information. */
    rc = nx_secure_dtls_session_create(&dtlsSession,
                                        &nx_crypto_tls_ciphers,
                                        cryptoMetaData,
                                        sizeof(cryptoMetaData),
                                        dtlsPacketBuf,
                                        sizeof(dtlsPacketBuf),
                                        REMOTE_CERT_CNT,
                                        remoteCertsBuf,
                                        sizeof(remoteCertsBuf));
    assert(rc == NX_SUCCESS);

    // Set the PSK
    rc = nx_secure_dtls_psk_add(&dtlsSession, (UCHAR *)DTLS_KEY, sizeof(DTLS_KEY) - 1, (UCHAR *)DTLS_IDENTITY, sizeof(DTLS_IDENTITY) - 1, NX_NULL, 0);
    assert(rc == NX_SUCCESS);

    // Without this second call, the identity will not be transmitted and is empty. nx_secure_dtls_psk_add does not seem to send the identity.
    rc = nx_secure_tls_client_psk_set(&dtlsSession.nx_secure_dtls_tls_session, (UCHAR *)DTLS_KEY, sizeof(DTLS_KEY) - 1, (UCHAR *)DTLS_IDENTITY, sizeof(DTLS_IDENTITY) - 1, NX_NULL, 0);
    assert(rc == NX_SUCCESS);

    /* Add the certificate to the local store using a numeric ID. */
    nx_secure_dtls_session_trusted_certificate_add(&dtlsSession,
                                                   identityCertPtr, 1);

    /* Set up IP address of remote host. */
    server_ip.nxd_ip_version = NX_IP_VERSION_V4;
    server_ip.nxd_ip_address.v4 = IP_ADDRESS(192, 168, 0, 15);

    /* Now we can start the DTLS session as normal. */
    rc = nx_secure_dtls_client_session_start(&dtlsSession, &sock, &server_ip, UDP_PORT,
                                             NX_IP_PERIODIC_RATE);
    if (rc == NX_SUCCESS)
        printf("DTLS init OK, connected to DTLS server.\n");
    else
        printf("Error connecting to DTLS server!\n");

    printf("Sending encrypted payloads...\n");
    for (int i = 0; i < 100; i++)
    {
        UINT rc;
        NX_PACKET *packetPtr;

        rc = nx_secure_dtls_packet_allocate(&dtlsSession, &pool_0, &packetPtr, NX_NO_WAIT);
        assert(rc == NX_SUCCESS);

        rc = nx_packet_data_append(packetPtr, testData, sizeof(testData), &pool_0, NX_NO_WAIT);
        assert(rc == NX_SUCCESS);

        rc = nx_secure_dtls_session_send(&dtlsSession, packetPtr, &server_ip, UDP_PORT);
        assert(rc == NX_SUCCESS);
    }

    printf("Test program terminated.\n");
}

/**
 * Application start-up
 *
 * @param first_unused_memory First-available unallocated RAM address for allocations
 */
void tx_application_define(void *first_unused_memory)
{
    UINT rc;

    /* Initialize the NetX system.  */
    nx_system_initialize();

    nx_crypto_initialize();

    /* Initialize the NetX Secure TLS/DTLS system.  */
    nx_secure_tls_initialize();

    /* Create a packet pool.  */
    rc = nx_packet_pool_create(&pool_0, "NetX Main Packet Pool", NX_ETH0_POOLPACKET_SIZE, NX_ETH0_POOL_MEMORY, sizeof(NX_ETH0_POOL_MEMORY));
    assert(rc == NX_SUCCESS);

    /* Create an IP instance.  */
    rc = nx_ip_create(&ip_0, "NetX IP Instance 0", DEFAULT_IP_ADDRESS, 0xFFFFFF00UL, &pool_0, NX_ETH0_DRIVER, ipStackMemory, sizeof(ipStackMemory), 1);
    assert(rc == NX_SUCCESS);

    /* Enable ARP and supply ARP cache memory for IP Instance 0.  */
    rc = nx_arp_enable(&ip_0, arpCacheMemory, sizeof(arpCacheMemory));
    assert(rc == NX_SUCCESS);

    /* Enable ICMP */
    rc = nx_icmp_enable(&ip_0);
    assert(rc == NX_SUCCESS);

    /* Enable UDP traffic.  */
    rc = nx_udp_enable(&ip_0);
    assert(rc == NX_SUCCESS);

    /* Enable TCP traffic.  */
    rc = nx_tcp_enable(&ip_0);
    assert(rc == NX_SUCCESS);

    /* Create the main thread.  */
    rc = tx_thread_create(&main_thread, "main", main_thread_entry, 0, mainStackMemory, sizeof(mainStackMemory), 4, 4, TX_NO_TIME_SLICE, TX_AUTO_START);
    assert(rc == NX_SUCCESS);
}

/**
 * Entry point. Launches ThreadX RTOS and won't return.
 */
int main(void)
{
    tx_kernel_enter();
}

The WireShark log can then be used to analyse the time it took for the 100 packets which is in this example 27ms for 100 packets:

2023-08-29_16-49-35

That way it is easy to compare the different code modification to the AES-CBC like 'no SCE9' vs 'SCE9' vs 'SCE9 with alignment enforcement disabled'.

hwmaier commented 10 months ago

Another update: I repeated the tests with plain TLS instead of DTLS. The behaviour is similar to what I reported above for DTLS. The payload data passed to sce_nx_crypto_cbc_encrypt() after calling nx_secure_tls_session_send() is not aligned. In fact I could not trigger the case that sce_nx_crypto_cbc_encrypt() ever is called with aligned data from NetXDuo. So the speed improvements by disabling the alignment check would also benefit TLS.

michaelthomasj commented 10 months ago

Hi Henrik, Thank you for the detailed data. We have fixed this on our working repo and will be in FSP 5.0. Let me check if there is some way to get to you a patch till that is released end of October. Or you can edit the code to have that disabled and continue with your project. Thanks

hwmaier commented 10 months ago

@michaelthomasj Thank you for attending to this. I did only talk about encryption in this post, but the same would issue would apply to the decryption routine. Are you fixing sce_nx_crypto_cbc_decrypt() as well in FSP 5.0?

Regards the patch, I am happy to apply it locally which I already have done, no need to get the patched version now.

michaelthomasj commented 10 months ago

Yes, we've fixed both sce_nx_crypto_cbc_decrypt and sce_nx_crypto_cbc_encrypt.