Tokenizer - Githubissues

rvanvelzen / lime

Lime is a LALR(1) parser generator written in PHP. The original source code can be found at http://sourceforge.net/projects/lime-php/, this is just a fork.

59 stars 9 forks source link

Tokenizer #3

Open AlgorithMan-de opened 8 years ago

AlgorithMan-de commented 8 years ago

I wrote a very very cheap lexer/tokenizer generator in BASH to accompany lime (took the manual one from your calc example as reference). Maybe you want to put it in? If you preferred it to be written in php - it should be translatable easily - it mostly consists of echos. The main advantage is that it selects first longest matches, not first matches.

pflex.txt calc.txt

AlgorithMan-de commented 8 years ago

Here, I translated it to php pflex.txt

AlgorithMan-de commented 8 years ago

With the generated tokenizer, the main calc.php becomes as simple as this calc.txt

gene-sis commented 8 years ago

I don't understand the advantage of selecting the longest match. My own tokenize function evaluates the regular expressions in a logical order without the chance of getting the wrong regex because of different lengths. It is also possible to split functions calls into smaller tokens. E.g. I implemented the syntax IFERROR( exp ; exp ) in the grammar like this: IF ERROR '(' exp ';' exp ')' This way you run only as many regexes as necessary because the first hit is always the right one.

Are there ambiguous regex combinations which would need the longest match? (even with lookaround?)

hikari-no-yume commented 8 years ago

Specified order would seem to be the more useful approach to me. In PHP, for example, reserved words (if, class) would otherwise be interpreted as function or constant names if the lexer didn't respect the order, I think. The regexes for both would match the same code.

AlgorithMan-de commented 8 years ago

In which "logical order" do you test for variable names and keywords? Variable names first will cause "class", "if", "while" etc. to be counted as variable names. Keywords first causes variable names that happen to start with something that is also a keyword to be counted as the keyword, followed by a variable name. So then you'd have to bake parts of the syntax into the regexps (like "if" is only a keyword if it is followed by a space... or a tab or linefeed or parenthesis... you'll never see the end of these cases.)

hikari-no-yume commented 8 years ago

The issue of “Keywords first causes variable names that happen to start with something that is also a keyword to be counted as the keyword” can be avoided by only matching on word boundaries with \b.

But, yes, come to think of it, simply matching the longest token would be simpler.

hikari-no-yume commented 8 years ago

You'd still need to handle the case of keywords, though. if would be matched as both an identifier and a keyword. Would the latter take priority if it was specified earlier?

AlgorithMan-de commented 8 years ago

Yes, because it chooses the FIRST longest match - because a regexp is chosen if its token is > than the previous one - not >= . That means it selects the first regexp that has a longest match - later regexps can only be chosen if they give you a longer match. So you only need to put keywords before identifiers. That is the commonly used approach at writing tokenizers (see lex, flex, jflex, etc)

hikari-no-yume commented 8 years ago

Alright, that seems reasonable.

gene-sis commented 8 years ago

keywords ^((class|if|while)(?![_a-zA-Z])|(else(?=[^_a-zA-Z]|if[^_a-zA-Z]))) comment till the end of line ^(\/\/[\S \r\t\f ]*) inline comment ^(\/\*[\S\s]*?\*\/) variable identifier, operators, parentheses ^([$\.;()+\-\/*=<>!:;]) item name (variable name, constant name, function name) ^([_a-zA-Z]*)

I use a grammar with numbers, {variable name}, functions( exp [; ... ] ) and mathematical operations so I won't need the complexity of the PHP language. I suppose, that there are better and more effective ways of lexing a complex language like PHP.

rvanvelzen commented 8 years ago

Selecting the first longest match is the most useful way to go. It precludes a whole bunch of hard-to-debug issues.

I'll look into integrating something easy like this. Thanks!

AlgorithMan-de commented 8 years ago

combining the two, I have already made a relatively strong parser for a programming language :-)