PHP Parser Generator Implementation

michael-becker commented 4 years ago

Hey folks,

as mentioned in #104, I developed an implementation for generating PHP parsers. Development was a little stalled due to one guy eating an undercooked bat in 2020 but we are now back on track :)

As you suggested in #104, it is a non-recursive implentation based on the javascript parser. I have tested the parser with several larger files with a pretty large grammar (~250 rules) and did not run into any exceptions as of yet. It also uses unicode code points for representing char and charclass entries.

Prerequisites

The PHP implementation uses the following extensions:

Standard PHP Library (SPL)
Internationalization Functions (intl)
Multibyte String (mbstring)

I developed against PHP 7.4.3 for Windows.

PHP calc example

The parser is generated with the command waxeye -g php . grammars/calc.waxeye resulting in the following file (see also example/php/calc/CalcParser.php):

<?php

use parser\config\Automata;
use parser\config\Automaton;
use parser\config\ParserConfig;
use parser\expression\Expression;
use parser\expression\Expressions;
use parser\NonTerminalMode;
use parser\Parser;

class CalcParser extends Parser
{
    public function __construct()
    {
        $automata = new Automata();

        $automata["calc"] = new Automaton("calc", NonTerminalMode::NORMAL, Expression::SeqExpression(Expression::NonTerminalExpression("ws"), Expression::NonTerminalExpression("sum")));
        $automata["sum"] = new Automaton("sum", NonTerminalMode::NORMAL, Expression::SeqExpression(Expression::NonTerminalExpression("prod"), Expression::StarExpression(Expression::SeqExpression(Expression::CharClassExpression(array(43, 45), array(), array()), Expression::NonTerminalExpression("ws"), Expression::NonTerminalExpression("prod")))));
        $automata["prod"] = new Automaton("prod", NonTerminalMode::NORMAL, Expression::SeqExpression(Expression::NonTerminalExpression("unary"), Expression::StarExpression(Expression::SeqExpression(Expression::CharClassExpression(array(42, 47), array(), array()), Expression::NonTerminalExpression("ws"), Expression::NonTerminalExpression("unary")))));
        $automata["unary"] = new Automaton("unary", NonTerminalMode::PRUNING, Expression::AltExpression(Expression::SeqExpression(Expression::CharExpression(45), Expression::NonTerminalExpression("ws"), Expression::NonTerminalExpression("unary")), Expression::SeqExpression(Expression::VoidExpression(Expression::CharExpression(40)), Expression::NonTerminalExpression("ws"), Expression::NonTerminalExpression("sum"), Expression::VoidExpression(Expression::CharExpression(41)), Expression::NonTerminalExpression("ws")), Expression::NonTerminalExpression("num")));
        $automata["num"] = new Automaton("num", NonTerminalMode::NORMAL, Expression::SeqExpression(Expression::PlusExpression(Expression::CharClassExpression(array(), array(48), array(57))), Expression::OptExpression(Expression::SeqExpression(Expression::CharExpression(46), Expression::PlusExpression(Expression::CharClassExpression(array(), array(48), array(57))))), Expression::NonTerminalExpression("ws")));
        $automata["ws"] = new Automaton("ws", NonTerminalMode::VOIDING, Expression::StarExpression(Expression::CharClassExpression(array(13, 32), array(9), array(10))));
        $config = new ParserConfig($automata, "calc");
        parent::__construct($config);
    }
}

The calc resides in example/php/calc/calc.php and can be run using php calc.php.

Other examples

In addition to the calc example, I have developed the following examples under examples/php

json: Running php JSONParserTest.php parses the JSON file 'example.json' using the JSON grammar provided by waxeye.
waxeye: Running php WaxeyeGrammarParserTest.php parses the provided grammars (calc, json, num, regexp, templ) using the waxeye grammar itself.
unicode_char: This directory contains a file UnicodeParserGrammarGenerator.php which is used to generate two grammar files: unicode_char.grammar and unicode_charclass.grammar. These grammars contain all possible unicode characters (as of unicode version 13: 143859 entries). When the respective parser is generated, the file UnicodeParserTest.php is used to test the grammars.

Parser generation

Parser generation is possible using the command waxeye -g php . grammar. The generation process resides in php.rkt.

Feel free to comment the code and give feedback on improvements.

michael-becker commented 4 years ago

@glebm Thanks for your review. I slightly changed the input handling to increase performance and spare us from hassling with specific encodings by converting the input string into an array using mb_str_split($input). In doing so, we have just one conversion call and all the rest of char matching is done using array indices. It also removes struggling with encoding length.

Concerning https://github.com/waxeye-org/waxeye/pull/107#discussion_r417576352, I would postpone this until completing the other tasks, since my IDE automatically inserts the space when I format my code :)

glebm commented 4 years ago

@michael-becker But doesn't means the user would have to call mb_str_split or similar to obtain parts of the input that correspond to parsed non-terminals?

In other languages we use native string indices to avoid that (e.g. UTF-16 indices in Java and JavaScript).

michael-becker commented 4 years ago

@glebm Ah man, it's 2020 and we still struggle with character encodings. What has gone wrong in the last years? ;)

Unfortunately, as I see it, PHP does not offer a convenient charPointAt-function like Java/Javascript do. As a workaround, I would suggest adding the following functions.

codePointAt: to extract the code point of a character in the given string:

function codePointAt(string $input, int $position, int $length = 1): int
{
    if ($length < 1 || $length > 4) {
        throw new RuntimeException("Code points smaller than 1 and larger than 4 bytes are not supported!");
    }

    $ord = IntlChar::ord(substr($input, $position, $length));

    if (null === $ord) {
        return codePointAt($input, $position, $length + 1);
    } else {
        return $ord;
    }
}

code point length: similar to the Javascript function but with adjustments to UTF-8

function codePointLength(int $codePoint): int
{
    if (0x0000 <= $codePoint && $codePoint <= 0x007F) {
        return 1;
    } elseif (0x0080 <= $codePoint && $codePoint < 0x07FF) {
        return 2;
    } elseif (0x0800 <= $codePoint && $codePoint <= 0xFFFF) {
        return 3;
    } else {
        return 4;
    }
}

With these functions, we should be able to use string indices without any more necessary transformations. Does that sound feasible?

glebm commented 4 years ago

There's no need for codePointAt or any decoding at all. If use byte indices throughout, you can just do this:

$matches = mb_strcut($this->input, $position, strlen($expected)) == $expected;

Here, $expected is a string in the same encoding as input representing the waxeye char (UTF-8).

In case of a match, advance by strlen($expected).

In case of non-match advance by the byte size of the code point at $position, e.g.:

// Returns the byte size of the code point starting at $position in $input.
function codePointSizeAt(string $input, int $position): int {
  // Assumes a code point is at most 4 bytes (true for UTF-8):
  return strlen(mb_substr(mb_strcut($this->input, $position, 4), 0, 1));
}

If substrings in PHP are copy-on-write (don't know if they are), the above could be simplified to this without loss of performance:

function codePointSizeAt(string $input, int $position): int {
  return strlen(mb_substr(mb_strcut($this->input, $position), 0, 1));
}

michael-becker commented 4 years ago

Thanks again for your feedback. I have adapted the code according to your changes in the ANY_CHAR, CHAR and CHAR_CLASS expression cases. To allow for different encodings, I parameterised the maximum byte size to be used.

waxeye-org / waxeye