guevara / read-it-later

read it later
231 stars 0 forks source link

How I Wrote PHP Skeleton For Bison #11895

Open guevara opened 1 week ago

guevara commented 1 week ago

How I Wrote PHP Skeleton For Bison

https://ift.tt/QzPmIKS

Anton Sukhachev

devm.io/php/php-skeleton-bison-generics

Do you dream of generics in PHP?

I wanted it so much - I made a library that brings generics in PHP.

<?php

namespace App;

class Box<T> {

    private ?T $data = null;

    public function set(T $data): void {
        $this->data = $data;
    }

    public function get(): ?T {
        return $this->data;
    }
}

If you are interested you can test it. Only native PHP is required (without extensions).

But in this article, I want to tell you about a very important part of my library - AST parser.

I use a very popular library nikic/php-parser. Many other software uses it.

It helps you to build AST from source code like this:

<?php

namespace App;

class Test
{
    public function test($foo) {}
}
.
├── ZEND_AST_STMT_LIST
    ├── ZEND_AST_NAMESPACE
    │   └── ZEND_AST_ZVAL 'App'
    └── ZEND_AST_CLASS 'Test'
        └── ZEND_AST_STMT_LIST
            └── ZEND_AST_METHOD 'test'
                └── ZEND_AST_PARAM_LIST
                    └── ZEND_AST_PARAM
                        └── ZEND_AST_ZVAL 'foo'

Every AST parser has a lexical analyzer, syntax analyzer, and AST builder. Usually, it grouped into Lexer and Parser.

You don't need to write Lexer and Parser from scratch.

To build Lexer you can use tools:

  • re2c - PHP engine uses it to parse source code
  • get_token_all() - php-parser uses this function to parse source code
  • doctrine lexer - doctrine uses it to parse annotations

How do Lexers work?

Lexers help you to parse text into tokens.

For example PHP engine's Lexer use re2c.

php-src Lexer example

Below you can see PHP code and tokens from Lexer.

<?php    |   T_OPEN_TAG
         |   T_WHITESPACE
$a = 1;  |   T_VARIABLE T_WHITESPACE = T_WHITESPACE T_LNUMBER ;
         |   T_WHITESPACE
echo $a; |   T_ECHO T_WHITESPACE T_VARIABLE ;

We can think about PHP engine and php-parser Lexers as similar Lexers because function get_token_all() calls re2c functions under the hood.

After the Lexer we have tokens, and we need a Parser to build AST.

To build Parser you can use the tools:

  • Bison - PHP engine uses it
  • KmYacc - php-parser uses it
  • ANTLR - Twitter search uses ANTLR for query parsing

How do parser generators work?

A generator takes your grammar.y BNF file, parses it, extracts all definitions, and then constructs a bunch of tables like this:

$yytable = [
    6, 3, 7, 20, 8, 51, 28, 1, 52, 4,
    9, 13, 10, 29, 15, 30, 18, 31, 16, 19,
    32, 22, 33, 34, 23, 24, 35, 11, 37, 25,
    21, 38, 39, 26, 45, 0, 40, 42, 0, 43,
    41, 0, 0, 49, 0, 0, 0, 0, 0, 47,
    48, 0, 50, 0, 53, 54
];

Then, this data is passed to a template that is called a Skeleton.

For Bison, Skeleton is a special file written in M4 language that renders your parser file.

By default, Bison Skeletons supports C/C++/D/Java languages.

PHP engine and php-parser use different parser generators but use very similar grammar files.

php-src grammar example

statement:
    |   T_BREAK optional_expr ';'    { $$ = zend_ast_create(ZEND_AST_BREAK, $2); }
    |   T_CONTINUE optional_expr ';' { $$ = zend_ast_create(ZEND_AST_CONTINUE, $2); }
    |   T_RETURN optional_expr ';'   { $$ = zend_ast_create(ZEND_AST_RETURN, $2); }

php-parser grammar example

non_empty_statement:
    |   T_BREAK optional_expr semi    { $$ = Stmt\Break_[$2]; }
    |   T_CONTINUE optional_expr semi { $$ = Stmt\Continue_[$2]; }
    |   T_RETURN optional_expr semi   { $$ = Stmt\Return_[$2]; }

After all this information about parsers, we can summarize it on the scheme:

I had thought about replacing KmYacc with Bison in php-parser.

It is great for PHP engine and php-parser to use the same tools to make the same job.

Even the fact, that Bison doesn't have PHP Skeleton didn't stop me.

I decided to create my own skeleton.

I translated Java skeleton to PHP. It took a few months for me.

Translating Java code to PHP is not very hard, but if your code is not written with m4 and has not very many options.

Java-skeleton example

]b4_yystype[ lval = yylexer.getLVal();]b4_locations_if([[
]b4_location_type[ yyloc = new ]b4_location_type[(yylexer.getStartPos(), yylexer.getEndPos());
status = push_parse(token, lval, yyloc);]], [[
status = push_parse(token, lval);]])[

PHP-skeleton example

/** @@var ]b4_yystype[ */
$lval = $this->yylexer->getLVal();]b4_locations_if([[
/** @@var ]b4_location_type[ */
$yyloc = new ]b4_location_type[($this->yylexer->getStartPos(), $this->yylexer->getEndPos());
$status = $this->push_parse($token, $lval, $yyloc);]], [[
$status = $this->push_parse($token, $lval);]])[

After a few months and many auto tests php-skeleton was ready!

[php-bison-skeleton] composer test
> php vendor/bin/phpunit
PHPUnit 9.6.5 by Sebastian Bergmann and contributors.

................................................................. 65 / 72 ( 90%)
.......                                                           72 / 72 (100%)

Time: 00:04.037, Memory: 6.00 MB

OK (72 tests, 384 assertions)

Then I tried to replace KmYacc with Bison.

You can reproduce the replacement with the steps:

  • install required libraries:

    composer require --dev mrsuh/php-bison-skeleton
    composer require nikic/php-parser
  • generate grammar file of php-parser:

    cd vendor/nikic/php-parser/
    composer install
    php grammar/rebuildParsers.php --keep-tmp-grammar
    cp grammar/tmp_parser.phpy ../../../../../examples/php/nikic-grammar.y
  • replace the dollar sign before Bison generate Parser and replace it back after because Bison doesn't support dollar sign in the grammar:

    php bin/replace-dollar-sign.php in nikic-grammar.y nikic-grammar-replaced.y
    bison -S ../../src/php-skel.m4 -o lib/parser-tmp.php nikic-grammar-replaced.y
    php bin/replace-dollar-sign.php out lib/parser-tmp.php lib/parser.php

Great! The parser is ready.

Time to compare PHP parser generated with Bison and KmYacc.

I had run tests with 3 different files sizes and different PHP versions (smaller is better):

As you can see performance of the parser generated with Bison is slower than the parser generated with KmYacc.

I tried to optimize generated parser code, but it gave maximum ~15 percent improvement. Not such much.

In the end, I replaced KmYacc with Bison in php-parser, but it works not such well as I imagined.

Now I have a well-working php-skeleton for Bison.

Maybe next time I'll try to replace KmYacc with ANTLR.

You can found php-bison-skeleton, many examples and tests into this repository

Thank you for your time. Hope you find this article useful.







via mrsuh.com https://mrsuh.com

November 15, 2024 at 06:38PM