greenlion / PHP-SQL-Parser

A pure PHP SQL (non validating) parser w/ focus on MySQL dialect of SQL
BSD 3-Clause "New" or "Revised" License
603 stars 156 forks source link

Parsing is too slow #85

Closed witchi closed 9 years ago

witchi commented 9 years ago

From rob...@gmail.com on May 09, 2012 15:59:55

Not a bug but I've found the parser is a bit too slow for my needs. I've attached some pretty minor changes that (on my machine at least) make it over 10x faster... hope you find it useful - thanks for a great project!

Attachment: php-sql-parser.diff php-sql-parser.php

Original issue: http://code.google.com/p/php-sql-parser/issues/detail?id=59

witchi commented 9 years ago

From pho...@gmx.de on May 09, 2012 13:20:26

Heh, nice work. I'm not an experienced PHP developer, so every optimizing is welcome!

Status: Accepted
Owner: pho...@gmx.de
Labels: -Type-Defect Type-Enhancement

witchi commented 9 years ago

From pho...@gmx.de on May 09, 2012 23:30:57

I will include some of your optimizations into the codebase, but only such, which don't obscure the algorithms too much.

witchi commented 9 years ago

From kor...@gmail.com on May 10, 2012 10:59:54

I've tried these changes and on my machine parsing the last query (most complex) in example.php file is ~5 times faster. Great performance. However with this patch, some tests fail.

Parser is very good, but didn't have much time to look at the source and check parsing algorithm. I have custom built ORM, but parsing strategy is on OOP base - not sql string parser. Sql API is in OOP. So, I'm thinking about this parser as replacement, but maybe I need to port to Postgres, or maybe reorganize some pieces in order to integrate it. Not sure yet.

I have one question. I would much appreciate if somebody from the team can give me any feedback. Do you have some internet resources about parsing algorithm (bookmarked, etc...) or any kind of info which can help in upgrading sql parser or for better understanding of its design ?

Thanks in advance.

witchi commented 9 years ago

From pho...@gmx.de on May 11, 2012 00:25:22

At the moment I use http://dev.mysql.com/doc/refman/5.1/en/ to have an idea, which sql statements are possible. You can implement a validating or a non-validating parser. The first one checks the syntax and stops on an error, the latter tries to recognize some keywords and uses "probabilities" to go on.

For the first one you need a grammar, so you can decide, which is the next allowed keyword/character. The other parser looks forward to the next token and implements a type of state machine, to find out the meaning of the token.

An example: A CREATE TABLE statement can have a KEY definition, where after the KEY word follows an index name and/or an index type. The parser looks for "KEY" and sets a state point. So it knows on the next token, that it can be a name or a type. But it can also be a index column. But a column follows always after a "(" character. So you have to look for "(" first to make this decision. In all other cases you have to compare the token with valid index types (BTREE, HASH) to make a difference between type and name. If the statement contains keywords i.e. column names, the PHPSQLParser has no chance to set another meaning for the keyword token (see issue #34 ). If you use BTREE as index name too, the parser will go wrong (the keyword would be interpreted as index type and name would be empty). For the non-validating parser there is a "probability", that BTREE is the type and not the name.

A validating parser would not split the statement into large tokens (which can be error-prone). It would go through the string character by character and decide by the grammar, which is a valid character. There are parser, which can use larger tokens than a single character, but this depends on the grammar (which is the smallest token size, which is useable to make all decision). In SQL you can only use size=1, because you have parts like operators (+, -, *) which have its own meaning.

An example: The PHPSQLLexer splits on ', which splits also string literals. The first task for the lexer is to balance the ' to get a complete string literal as one token (if I wouldn't do it, the parser could find keywords within the splitted string literal).

But the parser doesn't check, whether or not a string literal is allowed on this point within the sql statement. If there should be a table name, the PHPSQLParser will interpret the string literal token as table name. A validating parser would say "oh no my friend, on this position I allow only letters, numbers or underscores (which can be part of a table name) and not '" Full stop. http://en.wikipedia.org/wiki/LL_parser http://en.wikipedia.org/wiki/Finite_state_automata Hope, it helps Andre

witchi commented 9 years ago

From rob...@gmail.com on May 11, 2012 00:47:28

Looks like you've changed the structure a lot since I wrote the patch. Here's the major changes against the new version.

Changes are: 1) Remove startsWith 2) Make PHPSQLLexer use hashes instead of searching arrays 3) Store the keywords in uppercase

I know 3 might seems like a minor thing, but once the other changes are made, the test suite spends about half its time doing those uppercases!

I've checked and it's 100% passing the test suite. I haven't changed the algorithm at all - I know it looks like I've rewritten PHPSQLLexer::split() but it does exactly the same, just a slightly different way.

On my machine the test suite takes ~20.6 seconds with the original, ~0.8 seconds with this new one.

Attachment: php-sql-parser.php

witchi commented 9 years ago

From pho...@gmx.de on May 11, 2012 03:48:45

I have implemented 2) and 3) in trunk. I think, the lexer has potential to more optizations. Some functions use unset() and copy the tokens with array_values(). Maybe it is faster to transfer the tokens into a result array within the loops.

Currently I'm looking for 1).

The current version on trunk needs on my machine 2.2 seconds instead of 10 seconds, so we have a factor of about 5.

Is it possible in PHP to have static class variables? If I would initialize the functions/reserved arrays only once, it will be faster...

witchi commented 9 years ago

From rob...@gmail.com on May 11, 2012 03:57:11

Yup here's a copy with static class variables...

Attachment: php-sql-parser.php

witchi commented 9 years ago

From kor...@gmail.com on May 11, 2012 12:32:54

@Andre

Many thanks on all your effort and fantastic answer. I'll take a look more, at parser's internals, in the next days.

witchi commented 9 years ago

From greenlion@gmail.com on July 02, 2012 13:14:53

Or since it is very speed critical this should be the best.

I've changed the params to pass-by-reference which will eliminate memory allocation/copying for each invocation.

I removed the substr() which will require memory allocation and I replaced it with strncmp() which will compare the strings without allocating any new memory or doing any new copying.

Finally, I added an if() block that avoids allocating memory just to pack a string into an array, for it to immediately be unpacked.

    protected function startsWith(&$haystack, &$needles) {
        if (is_string($needles)) {
            if (strncmp($haystack, $needles,strlen($needles)) === true) return true;
        } else {
            foreach($needles as $needle) {    
                if (strncmp($haystack, $needle,strlen($needle)) === true) return $j; 
            }
        }
        return false;

    }
witchi commented 9 years ago

From greenlion@gmail.com on July 02, 2012 13:34:25

I tested that change, but it makes no effective difference :)

witchi commented 9 years ago

From pho...@gmx.de on July 03, 2012 01:26:27

:-)

I have removed all calls to startsWith() in a previous revision, the method is dead code, I remove it. Sorry...

witchi commented 9 years ago

From pho...@gmx.de on April 03, 2014 12:30:09

Status: Fixed