kokke / tiny-regex-c

Small portable regex in C
The Unlicense
1.22k stars 174 forks source link

Add Precompiled Regex Code Generator #59

Open marler8997 opened 3 years ago

marler8997 commented 3 years ago

I decided to use this library for a scripting language I'm working on. Here's a link to the directory that uses it if you're interested: https://github.com/stitchlang/stitch/tree/fedf09a6a522e8e48c963075c84aa44cfc74a951/src

One thing I wanted was to "compile" my regular expressions at compile time rather than runtime. Not only is this more performant, but it means I can have as many as I want (see https://github.com/kokke/tiny-regex-c/issues/3) and I don't need to allocate dynamic memory for them.

To solve this, I first noted that the regex_t data-structure is very simple:

typedef struct regex_t
{
  unsigned char  type;   /* CHAR, STAR, etc.                      */
  union
  {
    unsigned char  ch;   /*      the character itself             */
    unsigned char* ccl;  /*  OR  a pointer to characters in class */
  } u;
} regex_t;

I was able to write a function that takes a regex_t object and generates "C initializtion" code to "recreate itself". This function was easy to write becuase re.c already has a function named re_print that does something very similar. Here's what it looks like:

#include <re.h>
static regex_t INLINE_WHITESPACE[] = {
    { .type = 2 }, // BEGIN
    { .type = 8, { .ccl = (unsigned char*)" \t" } }, // CHAR_CLASS
    { .type = 6 }, // PLUS
    { .type = 0 }, // UNUSED
};
static regex_t USER_ID[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '$' } }, // CHAR
    { .type = 8, { .ccl = (unsigned char*)"a-zA-Z0-9_\\." } }, // CHAR_CLASS
    { .type = 6 }, // PLUS
    { .type = 7, { .ch = '$' } }, // CHAR
    { .type = 4 }, // QUESTIONMARK
    { .type = 0 }, // UNUSED
};
static regex_t NEWLINE[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '\n' } }, // CHAR
    { .type = 0 }, // UNUSED
};
static regex_t QUOTED_STRING[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '"' } }, // CHAR
    { .type = 9, { .ccl = (unsigned char*)"\"" } }, // INV_CHAR_CLASS
    { .type = 5 }, // STAR
    { .type = 7, { .ch = '"' } }, // CHAR
    { .type = 0 }, // UNUSED
};
static regex_t COMMENT[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '#' } }, // CHAR
    { .type = 9, { .ccl = (unsigned char*)"\n" } }, // INV_CHAR_CLASS
    { .type = 5 }, // STAR
    { .type = 0 }, // UNUSED
};
static regex_t OPEN_PAREN[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '(' } }, // CHAR
    { .type = 0 }, // UNUSED
};
static regex_t CLOSE_PAREN[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = ')' } }, // CHAR
    { .type = 0 }, // UNUSED
};
static regex_t ESCAPE_SEQUENCE[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '@' } }, // CHAR
    { .type = 8, { .ccl = (unsigned char*)"@#$\")(" } }, // CHAR_CLASS
    { .type = 0 }, // UNUSED
};

With the C initialization code, I can compile my regex objects directly into my final binary. Now I don't need to call re_compile at runtime. I can have as many regular expressions as I want and there's no need for dynamic memory.

Note that one big piece to this puzzle was exposing the regex_t definition to the user by moving it to the public header file re.h. Without this, the user would not be able to initalize the objects at compile-time.

For reference here's the function I wrote to generate this initialization code (https://github.com/stitchlang/stitch/blob/fedf09a6a522e8e48c963075c84aa44cfc74a951/src/tokens/compiler.c#L21). As you can see it's quite trivial. If a similar function were added to the library, even in a separate file like cgenerator.c then others would easily be able to do this as well.

kokke commented 3 years ago

Hi @marler8997 and thanks for all your effort (seeing all your other PRs etc.) :+1:

Thanks for the kind words. I also think the code is legible enough to be hackable, which is usually a quasi design goal.

Sorry for not having responded until now. I am super busy at work and privately at the moment and do not have time/energy to read or review all your nice suggestions. I expect it to clear up near months end, but please bear with me until then.

Please do not take it for neglect or disregard. I really appreciate your efforts.

kokke commented 3 years ago

Having (quickly!) read the README, I like the idea of stitch. I'm checking the examples now.

marler8997 commented 3 years ago

Sorry for not having responded until now. I am super busy at work and privately at the moment and do not have time/energy to read or review all your nice suggestions. I expect it to clear up near months end, but please bear with me until then.

Ok no problem thanks for letting me know. I have everything I need for right now with my fork and/or branch but if/when you're ready I'm willing to put in effort into upstreaming any changes, bug fixes or features you'd like to bring in as well. I'll plan on doing that after you find some time to address my current PRs. My hope is my fork/branch are only temporary until we can negotiate which changes can and should be upstreamed here.

marler8997 commented 3 years ago

Having (quickly!) read the README, I like the idea of stitch. I'm checking the examples now.

Thanks! This is probably my 5th attempt to come up with something to replace my BASH scripts that I'm happy with. I think I have most of the overall language roughly laid out, but there's still alot of smaller details to figure out such as if I should add keywords, if I should represent binary operators in the syntax or not, if I should add special syntax for arrays and/or environment variables. But luckily the way I have it now is I can easily make changes and experiment without having to write anything in stone. Hopefully it will be ready for production in the coming months.