Parsing is too slow - Githubissues

GoogleCodeExporter commented 9 years ago

Not a bug but I've found the parser is a bit too slow for my needs.
I've attached some pretty minor changes that (on my machine at least) make it 
over 10x faster... hope you find it useful - thanks for a great project!

Original issue reported on code.google.com by rob...@gmail.com on 9 May 2012 at 1:59

Attachments:

GoogleCodeExporter commented 9 years ago

Heh, nice work. I'm not an experienced PHP developer, so every optimizing is 
welcome!

Original comment by pho...@gmx.de on 9 May 2012 at 8:20

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

I will include some of your optimizations into the codebase, but only such, 
which don't obscure the algorithms too much.

Original comment by pho...@gmx.de on 10 May 2012 at 6:30

GoogleCodeExporter commented 9 years ago

I've tried these changes and on my machine parsing the last query (most 
complex) in example.php file is ~5 times faster. Great performance. However 
with this patch, some tests fail.

Parser is very good, but didn't have much time to look at the source and check 
parsing algorithm.
I have custom built ORM, but parsing strategy is on OOP base - not sql string 
parser. Sql API is in OOP. So, I'm thinking about this parser as replacement, 
but maybe I need to port to Postgres, or maybe reorganize some pieces in order 
to integrate it. Not sure yet.

I have one question. I would much appreciate if somebody from the team can give 
me any feedback.
Do you have some internet resources about parsing algorithm (bookmarked, 
etc...) or any kind of info which can help in upgrading sql parser or for 
better understanding of its design ? 

Thanks in advance.

Original comment by kor...@gmail.com on 10 May 2012 at 5:59

GoogleCodeExporter commented 9 years ago

At the moment I use http://dev.mysql.com/doc/refman/5.1/en/ to have an idea, 
which sql statements are possible. You can implement a validating or a 
non-validating parser. The first one checks the syntax and stops on an error, 
the latter tries to recognize some keywords and uses "probabilities" to go on. 

For the first one you need a grammar, so you can decide, which is the next 
allowed keyword/character. The other parser looks forward to the next token and 
implements a type of state machine, to find out the meaning of the token.

An example:
A CREATE TABLE statement can have a KEY definition, where after the KEY word 
follows an index name and/or an index type. The parser looks for "KEY" and sets 
a state point. So it knows on the next token, that it can be a name or a type. 
But it can also be a index column. But a column follows always after a "(" 
character. So you have to look for "(" first to make this decision. In all 
other cases you have to compare the token with valid index types (BTREE, HASH) 
to make a difference between type and name. If the statement contains keywords 
i.e. column names, the PHPSQLParser has no chance to set another meaning for 
the keyword token (see issue 34). If you use BTREE as index name too, the 
parser will go wrong (the keyword would be interpreted as index type and name 
would be empty). For the non-validating parser there is a "probability", that 
BTREE is the type and not the name.

A validating parser would not split the statement into large tokens (which can 
be error-prone). It would go through the string character by character and 
decide by the grammar, which is a valid character. There are parser, which can 
use larger tokens than a single character, but this depends on the grammar 
(which is the smallest token size, which is useable to make all decision). In 
SQL you can only use size=1, because you have parts like operators (+, -, *) 
which have its own meaning.

An example:
The PHPSQLLexer splits on ', which splits also string literals. The first task 
for the lexer is to balance the ' to get a complete string literal as one token 
(if I wouldn't do it, the parser could find keywords within the splitted string 
literal). 

But the parser doesn't check, whether or not a string literal is allowed on 
this point within the sql statement. If there should be a table name, the 
PHPSQLParser will interpret the string literal token as table name. A 
validating parser would say "oh no my friend, on this position I allow only 
letters, numbers or underscores (which can be part of a table name) and not '" 
Full stop.

http://en.wikipedia.org/wiki/LL_parser
http://en.wikipedia.org/wiki/Finite_state_automata

Hope, it helps
Andre

Original comment by pho...@gmx.de on 11 May 2012 at 7:25

GoogleCodeExporter commented 9 years ago

Looks like you've changed the structure a lot since I wrote the patch.
Here's the major changes against the new version.

Changes are:
1) Remove startsWith
2) Make PHPSQLLexer use hashes instead of searching arrays
3) Store the keywords in uppercase

I know 3 might seems like a minor thing, but once the other changes are made, 
the test suite spends about half its time doing those uppercases!

I've checked and it's 100% passing the test suite.  I haven't changed the 
algorithm at all - I know it looks like I've rewritten PHPSQLLexer::split() but 
it does exactly the same, just a slightly different way.

On my machine the  test suite takes ~20.6 seconds with the original, ~0.8 
seconds with this new one.

Original comment by rob...@gmail.com on 11 May 2012 at 7:47

Attachments:

php-sql-parser.php

GoogleCodeExporter commented 9 years ago

I have implemented 2) and 3) in trunk. I think, the lexer has potential to more 
optizations. Some functions use unset() and copy the tokens with 
array_values(). Maybe it is faster to transfer the tokens into a result array 
within the loops.

Currently I'm looking for 1).

The current version on trunk needs on my machine 2.2 seconds instead of 10 
seconds, so we have a factor of about 5.

Is it possible in PHP to have static class variables? If I would initialize the 
functions/reserved arrays only once, it will be faster...

Original comment by pho...@gmx.de on 11 May 2012 at 10:48

GoogleCodeExporter commented 9 years ago

Yup here's a copy with static class variables...

Original comment by rob...@gmail.com on 11 May 2012 at 10:57

Attachments:

php-sql-parser.php

GoogleCodeExporter commented 9 years ago

@Andre

Many thanks on all your effort and fantastic answer. 
I'll take a look more, at parser's internals, in the next days.

Original comment by kor...@gmail.com on 11 May 2012 at 7:32

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Or since it is very speed critical this should be the best.

I've changed the params to pass-by-reference which will eliminate memory 
allocation/copying for each invocation.

I removed the substr() which will require memory allocation and I replaced it 
with strncmp() which will compare the strings without allocating any new memory 
or doing any new copying.

Finally, I added an if() block that avoids allocating memory just to pack a 
string into an array, for it to immediately be unpacked.

        protected function startsWith(&$haystack, &$needles) {
            if (is_string($needles)) {
                if (strncmp($haystack, $needles,strlen($needles)) === true) return true;
            } else {
                foreach($needles as $needle) {    
                    if (strncmp($haystack, $needle,strlen($needle)) === true) return $j; 
                }
            }
            return false;

        }

Original comment by greenlion@gmail.com on 2 Jul 2012 at 8:14

GoogleCodeExporter commented 9 years ago

I tested that change, but it makes no effective difference :)

Original comment by greenlion@gmail.com on 2 Jul 2012 at 8:34

GoogleCodeExporter commented 9 years ago

:-) 

I have removed all calls to startsWith() in a previous revision, the method is 
dead code, I remove it. Sorry...

Original comment by pho...@gmx.de on 3 Jul 2012 at 8:26

GoogleCodeExporter commented 9 years ago

Original comment by pho...@gmx.de on 3 Apr 2014 at 7:30

Changed state: Fixed

mnpenner / php-sql-parser

Parsing is too slow #59