kokke / tiny-regex-c

Small portable regex in C
The Unlicense
1.24k stars 174 forks source link

use flat memory layout #62

Open marler8997 opened 3 years ago

marler8997 commented 3 years ago

This is a proposed change to the storage mechanism of compiled regular expressions. Currently, a compiled regex is made up of an array of these fixed-size structures:

typedef struct regex_t
{
  unsigned char  type;   /* CHAR, STAR, etc.                      */
  union
  {
    unsigned char  ch;   /*      the character itself             */
    unsigned char* ccl;  /*  OR  a pointer to characters in class */
  } u;
} regex_t;

The newly proposed data structure is a sequence of variable-length nodes of this structure:

typedef struct regex_t
{
  unsigned char type;    /* CHAR, STAR, etc.                      */
  unsigned char data_len;
  unsigned char data[0];
} regex_t;

The first difference between the two is that the new structure does not contain any pointers to any other data. All the data for the regex is self-contained which means there's no ownership dependencies to handle.

The next difference is fixed-length vs variable-length. The current fixed-length regex_t structure is 8 bytes on 32-bit systems and 16 bytes on 64-bit systems. This is because even though the type field is a single byte, the second field contains a pointer which needs to be aligned on the system's word boundary (4 for 32-bit 8 for 64-bit). Note that it also wouldn't matter whether the field order were changed, because the size of the structure also needs to align the next element if it were in an array. In comparison, the newly proposed data structure is variable length. It is 2 bytes for most types with CHAR, CHAR_CLASS and INV_CHAR_CLASS taking more bytes to store their raw characters. The new structure requires less memory to store any regex and typically ends up taking 2 to 4 times less memory.

It's also worth noting that the new structure has no alignment requirement, so it can be moved to any location in memory with something like memcpy.

Another difference is the new data structure should be a bit more cache friendly while checking for matches. Along with being smaller, it also turned the original structure of 2 buffers into 1 buffer which guarantees that the type id is right next the data it needs while checking for matches.

Yet another difference is the buffers to store both the regex objects and the character data are "uncolored". This means you don't have to adjust two different sizes of two different buffers to make sure you have enough memory to store a regex, there is only one size to adjust.

I also tested run time using time ./tests/test2 with this change. Before this changes times range from 4 to 5.5 seconds, after this change they ranged from 3.8 to 4 seconds. Times were more consistent and around 20% faster on average, which makes sense considering that it should be more cache friendly.