ER: Regex-defined custom tokens

numist commented 3 years ago

It occurs to me that a C-based custom token matcher is overkill in most cases. For example:

.token hex '0x0BADCAFE' '0x248c'
.token oct '0644' '0777771'
.token bin '11001011b' '10b'

Could be completely defined in the language specification using regexes:

.token hex /0x[0-9A-Fa-f]+/
.token oct /0[1-7][0-7]*/
.token bin /[01]+b/

numist commented 3 years ago

This turned out to not be too complicated to implement in my toy engineering calculator, though integrating it into the parser generator is another thing entirely:

#include "parser.h"
#include <assert.h>
#include <regex.h>

#ifndef ARRAY_SIZE
#define ARRAY_SIZE(X) (sizeof(X)/sizeof(X[0]))
#endif

struct owl_token match_token(const char *string, void *unused) {
  static struct {
    enum owl_token_type token_type;
    const char *regex_str;
  } regex_tokenizers[] = {
    { .token_type = OWL_TOKEN_HEX, .regex_str = "^0[xX][0-9a-fA-F]+" },
    { .token_type = OWL_TOKEN_HEX, .regex_str = "^[0-9a-fA-F]+[hH]" },
    { .token_type = OWL_TOKEN_OCT, .regex_str = "^0[1-7][0-7]*" },
    { .token_type = OWL_TOKEN_OCT, .regex_str = "^[0-7]+[oO]" },
    { .token_type = OWL_TOKEN_BIN, .regex_str = "^[01]+[bB]" },
    { .token_type = OWL_TOKEN_SCI, .regex_str = "^[0-9]+\\.[0-9]+[eE][-]?[0-9]+" }
  };

  struct owl_token result = owl_token_no_match;
  for (int i = 0; i < ARRAY_SIZE(regex_tokenizers); i++) {
    regex_t re;
    int rc = regcomp(&re, regex_tokenizers[i].regex_str, REG_EXTENDED);
    assert(0 == rc);
    regmatch_t re_matches;
    rc = regexec(&re, string, 1, &re_matches, 0);
    assert(0 == rc || REG_NOMATCH == rc);
    if (0 == rc) {
      assert(re_matches.rm_so == 0);
      if (re_matches.rm_eo > result.length) {
        result = (struct owl_token){
          .length = (unsigned long)re_matches.rm_eo,
          .type = regex_tokenizers[i].token_type
        };
      }
    }
  }
  return result;
}

The regexes can be compiled once and reused (the compiled regex_ts are even documented to be thread-safe!) but doing so in a portable way is a bit of a crapshoot… I guess pthread_once could work, but even compiling them once per owl_tree_create_* would probably net a decent win.

ianh commented 3 years ago

There are two big things I'd want here that POSIX regexes can't provide:

token-level ambiguity detection If there are two rules like [0-9]+ (for integers) and [0-9]+(\.[0-9]*)? (for real numbers) which match the same text (123), the conflict should be reported instead of silently choosing one or the other at parse time.

reverse lexing for ambiguity reporting Ambiguities in the grammar are reported by finding a sequence of tokens that can produce two different parse trees. After finding this ambiguous sequence of tokens, the ambiguity checker generates a string which tokenizes back into that sequence. Each token has to be "reverse lexed" into a string which will turn back into that token. E.g., here the number token becomes the string 1:

$ owl -g 'x = number : a  number : b'
error: this grammar is ambiguous

. 1 

  can be parsed in two different ways: as

. 1   
  x:a 

  or as

. 1   
  x:b

ianh / owl

ER: Regex-defined custom tokens #20