Open michael-becker opened 4 years ago
@glebm Thanks for your review. I slightly changed the input handling to increase performance and spare us from hassling with specific encodings by converting the input string into an array using mb_str_split($input)
. In doing so, we have just one conversion call and all the rest of char matching is done using array indices. It also removes struggling with encoding length.
Concerning https://github.com/waxeye-org/waxeye/pull/107#discussion_r417576352, I would postpone this until completing the other tasks, since my IDE automatically inserts the space when I format my code :)
@michael-becker But doesn't means the user would have to call mb_str_split
or similar to obtain parts of the input that correspond to parsed non-terminals?
In other languages we use native string indices to avoid that (e.g. UTF-16 indices in Java and JavaScript).
@glebm Ah man, it's 2020 and we still struggle with character encodings. What has gone wrong in the last years? ;)
Unfortunately, as I see it, PHP does not offer a convenient charPointAt-function like Java/Javascript do. As a workaround, I would suggest adding the following functions.
codePointAt: to extract the code point of a character in the given string:
function codePointAt(string $input, int $position, int $length = 1): int
{
if ($length < 1 || $length > 4) {
throw new RuntimeException("Code points smaller than 1 and larger than 4 bytes are not supported!");
}
$ord = IntlChar::ord(substr($input, $position, $length));
if (null === $ord) {
return codePointAt($input, $position, $length + 1);
} else {
return $ord;
}
}
code point length: similar to the Javascript function but with adjustments to UTF-8
function codePointLength(int $codePoint): int
{
if (0x0000 <= $codePoint && $codePoint <= 0x007F) {
return 1;
} elseif (0x0080 <= $codePoint && $codePoint < 0x07FF) {
return 2;
} elseif (0x0800 <= $codePoint && $codePoint <= 0xFFFF) {
return 3;
} else {
return 4;
}
}
With these functions, we should be able to use string indices without any more necessary transformations. Does that sound feasible?
There's no need for codePointAt
or any decoding at all. If use byte indices throughout, you can just do this:
$matches = mb_strcut($this->input, $position, strlen($expected)) == $expected;
Here, $expected
is a string in the same encoding as input
representing the waxeye char (UTF-8).
In case of a match, advance by strlen($expected)
.
In case of non-match advance by the byte size of the code point at $position
, e.g.:
// Returns the byte size of the code point starting at $position in $input.
function codePointSizeAt(string $input, int $position): int {
// Assumes a code point is at most 4 bytes (true for UTF-8):
return strlen(mb_substr(mb_strcut($this->input, $position, 4), 0, 1));
}
If substrings in PHP are copy-on-write (don't know if they are), the above could be simplified to this without loss of performance:
function codePointSizeAt(string $input, int $position): int {
return strlen(mb_substr(mb_strcut($this->input, $position), 0, 1));
}
Thanks again for your feedback. I have adapted the code according to your changes in the ANY_CHAR, CHAR and CHAR_CLASS expression cases. To allow for different encodings, I parameterised the maximum byte size to be used.
Hey folks,
as mentioned in #104, I developed an implementation for generating PHP parsers. Development was a little stalled due to one guy eating an undercooked bat in 2020 but we are now back on track :)
As you suggested in #104, it is a non-recursive implentation based on the javascript parser. I have tested the parser with several larger files with a pretty large grammar (~250 rules) and did not run into any exceptions as of yet. It also uses unicode code points for representing char and charclass entries.
Prerequisites
The PHP implementation uses the following extensions:
I developed against PHP 7.4.3 for Windows.
PHP calc example
The parser is generated with the command
waxeye -g php . grammars/calc.waxeye
resulting in the following file (see also example/php/calc/CalcParser.php):The calc resides in example/php/calc/calc.php and can be run using
php calc.php
.Other examples
In addition to the calc example, I have developed the following examples under examples/php
php JSONParserTest.php
parses the JSON file 'example.json' using the JSON grammar provided by waxeye.php WaxeyeGrammarParserTest.php
parses the provided grammars (calc, json, num, regexp, templ) using the waxeye grammar itself.UnicodeParserGrammarGenerator.php
which is used to generate two grammar files:unicode_char.grammar
andunicode_charclass.grammar
. These grammars contain all possible unicode characters (as of unicode version 13: 143859 entries). When the respective parser is generated, the fileUnicodeParserTest.php
is used to test the grammars.Parser generation
Parser generation is possible using the command
waxeye -g php . grammar
. The generation process resides inphp.rkt
.Feel free to comment the code and give feedback on improvements.