Implement scanners with regular expressions?

ikehz commented 10 years ago

I talked to Bob about this today, I'm curious what others think.

Couldn't we implement a scanner using a simple greedy algorithm matching a list of regular expressions? I'm imagining something like this:

REGEXS = { '+' => T_PLUS, '[a-z][A-Za-z0-9_]*' => T_ID, ... } // map of regexs for different kinds of tokens

def get_next_token()
  s = ''
  matches = 0
  while (matches != 1) do
    s += get next character
    check s against every regex in REGEXS and count the number of matches
    // take care of reaching EOF and other obvious things like whitespace
  end
  return new Token(s, REGEXS[s], line number)
end

Does that make sense to anyone else?

dan-f commented 10 years ago

I like the idea behind this, though I'm worried that it might end up pre-emptively matching tokens due to it's greedy nature. Remember that with the DFA approach we keep on building a token until we read a character which kicks us out of the current state. Consider the token 'void'. With this approach, I think we'll greedily match the letter 'v' as T_ID before we actually get 'o', 'i', and 'd'.

ikehz commented 10 years ago

As we just talked about ...

Good call. We'd have to revise it to take the largest string possible by consuming characters until there's no match:

REGEXS = { '+' => T_PLUS, '[a-z][A-Za-z0-9_]*' => T_ID, ... } // map of regexs for different kinds of tokens

def get_next_token()
  s = ''
  matches = 0
  while (matches != 0) do
    s += get next character
    check s against every regex in REGEXS and count the number of matches
    // take care of reaching EOF and other obvious things like whitespace
  end
  // backtrack one character
  return new Token(s, REGEXS[s], line number)
end

Dan also brought up the question of comments & string literals. Since we're not (?) doing nested comments, comments are regular; string literals might be a little harder with escapes & stuff, but probably not a big pain.

dbarella commented 10 years ago

I actually implemented my scanner first as a DFA and then redid it using only the python regex engine. The regex engine makes the whole thing really beautiful because you literally specify the regex for each token type (with some voodoo using named groups to include the token type with the match) and then handle the matches. There was a little trouble with rejecting strings like a <== b because of the greedy match, but I ended up getting around that by checking for the presence of another operator in the next match place (so it will match <= as an operator, do a lookahead to find = and then reject the string as a syntax error.

However, I was thinking about this and I realized that there's a problem with pointers, which use & and * but in a context sensitive way. My scanner will erroneously reject the string *a + *b and I'm still thinking about how to handle this. I could just classify * and & as special operators, or I could allow all operators to proceed one another and then reject the syntax in the parser. Does it make more sense to do this check in the parser, where we'll run into an improper AST quick? My thought is that because this is a context-dependent rule, the tokenizer should not try to handle it. But other things like // aren't allowed, so I should classify * and & as special ops and disregard them in the operator lookahead.

p.s. A special consequence of my approach is that int 12/*this is a huge int!!!!∂∂∂/fdsfÄ*/34 is tokenized as {T_INT : int, T_NUM: 1234}

dan-f commented 10 years ago

@dbarella I have a few points for you regarding the edge cases you brought up. For context, I implemented by scanner using the DFA algorithm described in class, in Python. You mention that you're handling the syntax error a <== b. Using the DFA algorithm, my scanner parses this as the token a, followed by <=, =, and finally b. Did you ask Bob about this? It seems like this is a syntax error that the parser should be able to catch, since it shouldn't match any defined grammar rules. Not to say that it's wrong that your scanner is trying to catch this error, but I wouldn't worry about it.

As far as tokenizing pointers: I'm pretty sure that a * is just a 'star', and & is just an 'ampersand'. As you've pointed out the function of these symbols depend on their context, so I don't think that their token type needs to represent their semantic meaning. Again, I think that this is the type of thing the parser should be able to figure out based on grammar rules. I'm not sure if this was the question you were getting at, though.

dbarella commented 10 years ago

I think what your tokenizer finds makes sense. There isn't any way for a DFA to reject these kinds of strings, at least not simply. I figured that having the tokenizer set up this way (with regexes) makes it easy enough to reject illegal sequences like === and +- which can't be allowed. It might be a breach of encapsulation in the stricter sense, meaning that the tokenizer should theoretically not handle this and it should be up to the parser to realize that T_PLUS can't take T_MINUS as an argument, but I dunno. I have two sets of punctuation regexes, one of braces and parens and all that nonsense which are allowed to follow each other, and others like slash, plus, minus, mod which aren't allowed to follow one another, and I just added the * and & operators to the non-lookahead group. I haven't asked Bob about it though, but it wouldn't be hard to undo this change – just remove the operator lookahead and it's a pure DFA again.

EDIT: I'm beginning to suspect that I should be very careful about preventing some char sequences. Probably it's for the best to allow most sequences and have the parser differentiate. For BPL it does make some sense to do this in the tokenizer but larger languages I don't think I'd attempt it, just let the parser reject them.

ikehz commented 10 years ago

+1 to @daf-'s solution. I think the parser—not the scanner—should take care of cases like <==.

Regarding T_PLUS & T_MINUS, I was wondering about situations like 2-+1 and 2--1 as well. In Ruby, 2+-1 == 1 and 2--1 == 3, since -, (and apparently +,) is also a unary operator.

dan-f commented 10 years ago

Sorry everyone for the close/open. I just meant to discard a drafted comment.

peter-fogg commented 10 years ago

I think the best way is probably to handle most errors in the parser. My scanner only ever throws an error if it finds an unterminated string or an illegal character like @.

ikehz commented 10 years ago

... if it finds an unterminated string or comment or an illegal character like @.

dan-f commented 10 years ago

Yeah, same.

ikehz commented 10 years ago

For the sake of clarity, further discussion of what a Scanner should and should not accept, I'm opening a new issue: #4.

ikehz commented 10 years ago

So, getting back to the original question, after building the DFA-style scanner, I realized what would make a regex-driven scanner so ugly, I think.

Using the algorithm I presented above, we assume that a string that doesn't match a regex cannot possibly match a regex by adding more characters. This is similar to the problem @daf- originally pointed out, but different. The only tokens in BPL I can think of that would cause this problem are:

string literals,
comments (not really tokens, but it'd be nice to scan them with regexes anyway), and
!=.

!= is a particularly good example. If we use just the algorithm above, we get no match on the string !, even though if we add the next !=, it would work. So we actually need two sets of regexs? One to answer the question, "Is there potential for this string to be a token?" and another to answer the question, "Is this string a valid token?"

@dbarella, how did you get around/solve this issue?

dbarella commented 10 years ago

So, the heart of my scanner is based on a set of token regex definitions:

tok_spec = [
    ('T_ID',            r'[A-Za-z]\w*'),               #Identifiers
    ('T_NUM',           r'\d+'),                       #Numbers
    ('T_STRING',        r'"[^"]*"'),                   #Strings
    ('T_SYM',           r'[\*;,\[\]\{\}\(\)]'),        #Symbols, allowed to follow each other
    ('T_OP',            r'[&\/\+\-%]|<=?|>=?|!=|==?'), #Operators, not allowed to follow each other
    ('IGNORE',          r'\s|\n'),                     #Ignore spaces and newlines
]
tok_regex = '|'.join('(?P<{0}>{1})'.format(*pair) for pair in tok_spec) #Named regex pairs
get_tok = re.compile(tok_regex).match #Regex match function

Using get_tok (just a reference to the regex match function) I step across each line, calling the match function at the beginning of the line. If get_tok(line) is None (or matches the IGNORE regex) then the line doesn't match and I ignore it. If it does match, then python's re module will return a SRE_Match object which contains references to the groups that are defined in that voodoo-y (?P<{0}>{1}) statement. The idea for a named regex pair isn't mine, but it seems popular among python people and this is the first chance I had to actually apply it – the syntax seems kinda wonk to me but it works as stated so, meh.

One very nifty bit of the match function is that it will tell me the location and length of the matching token in the line, so if it's an invalid token I can point it out in an error message:

Bad Token [Dangling comment] on line 17:
>>> x = x + 1; */
                ^

and I can also avoid explicit quadratic time by jumping ahead by the length of the token (so I don't call the regex on each character, assuming that the regex does a nearly-linear search through, which I only assume because I don't backreference in my token definitions). After this step I can pull the token type and value out of the match object and do_stuff accordingly. The IGNORE field turned out to be very important because otherwise it wouldn't be easy to recognize illegal characters – this way anything that doesn't fall into the set of regex definitions will trip an exception.

Also notice that this regex doesn't handle comments at all. This is because I decided to hold on to python's for line in iterable_object: syntax, which means that multi-line comments will fail to be recognized by a line-matching regex expression without some extra work. There are many ways around this; after a fair amount of thinking I decided the best way (though unfortunately lazier-sounding) would be to run a preprocess step on the source to remove all comments and yell at the programmer for dangling comment delimiters. That's a separate method, which technically doubles the running time of the scanner (e.g. O(n) + O(n) = 2 * O(n)) but I think it's more readable than working the logic into the regular tokenizing loop. I also realized right now that this will also stop the scanner before any tokens are emitted, which seems like a slight plus to me. Nothing big because it will either crash here or in the parser, but failing earlier in the game seems good to me.

So that's what I have so far, and I'm happy to elaborate on any of the code above or elsewhere in the program. Also with regards to what @ihmccreery said about matching !=, that didn't present me any trouble – what did cause a little bit of annoyance was accepting = and == while rejecting ===, the philosophy of which we're discussing separately in #4. In the case of my scanner, all tokens can be allowed to repeat by putting them in the T_SYM category, while anything in the T_OP category is restricted from appearing twice.

occs-compilers-2014-spring / discussion

Implement scanners with regular expressions? #2