Closed GoogleCodeExporter closed 8 years ago
Heh, nice work. I'm not an experienced PHP developer, so every optimizing is
welcome!
Original comment by pho...@gmx.de
on 9 May 2012 at 8:20
I will include some of your optimizations into the codebase, but only such,
which don't obscure the algorithms too much.
Original comment by pho...@gmx.de
on 10 May 2012 at 6:30
I've tried these changes and on my machine parsing the last query (most
complex) in example.php file is ~5 times faster. Great performance. However
with this patch, some tests fail.
Parser is very good, but didn't have much time to look at the source and check
parsing algorithm.
I have custom built ORM, but parsing strategy is on OOP base - not sql string
parser. Sql API is in OOP. So, I'm thinking about this parser as replacement,
but maybe I need to port to Postgres, or maybe reorganize some pieces in order
to integrate it. Not sure yet.
I have one question. I would much appreciate if somebody from the team can give
me any feedback.
Do you have some internet resources about parsing algorithm (bookmarked,
etc...) or any kind of info which can help in upgrading sql parser or for
better understanding of its design ?
Thanks in advance.
Original comment by kor...@gmail.com
on 10 May 2012 at 5:59
At the moment I use http://dev.mysql.com/doc/refman/5.1/en/ to have an idea,
which sql statements are possible. You can implement a validating or a
non-validating parser. The first one checks the syntax and stops on an error,
the latter tries to recognize some keywords and uses "probabilities" to go on.
For the first one you need a grammar, so you can decide, which is the next
allowed keyword/character. The other parser looks forward to the next token and
implements a type of state machine, to find out the meaning of the token.
An example:
A CREATE TABLE statement can have a KEY definition, where after the KEY word
follows an index name and/or an index type. The parser looks for "KEY" and sets
a state point. So it knows on the next token, that it can be a name or a type.
But it can also be a index column. But a column follows always after a "("
character. So you have to look for "(" first to make this decision. In all
other cases you have to compare the token with valid index types (BTREE, HASH)
to make a difference between type and name. If the statement contains keywords
i.e. column names, the PHPSQLParser has no chance to set another meaning for
the keyword token (see issue 34). If you use BTREE as index name too, the
parser will go wrong (the keyword would be interpreted as index type and name
would be empty). For the non-validating parser there is a "probability", that
BTREE is the type and not the name.
A validating parser would not split the statement into large tokens (which can
be error-prone). It would go through the string character by character and
decide by the grammar, which is a valid character. There are parser, which can
use larger tokens than a single character, but this depends on the grammar
(which is the smallest token size, which is useable to make all decision). In
SQL you can only use size=1, because you have parts like operators (+, -, *)
which have its own meaning.
An example:
The PHPSQLLexer splits on ', which splits also string literals. The first task
for the lexer is to balance the ' to get a complete string literal as one token
(if I wouldn't do it, the parser could find keywords within the splitted string
literal).
But the parser doesn't check, whether or not a string literal is allowed on
this point within the sql statement. If there should be a table name, the
PHPSQLParser will interpret the string literal token as table name. A
validating parser would say "oh no my friend, on this position I allow only
letters, numbers or underscores (which can be part of a table name) and not '"
Full stop.
http://en.wikipedia.org/wiki/LL_parser
http://en.wikipedia.org/wiki/Finite_state_automata
Hope, it helps
Andre
Original comment by pho...@gmx.de
on 11 May 2012 at 7:25
Looks like you've changed the structure a lot since I wrote the patch.
Here's the major changes against the new version.
Changes are:
1) Remove startsWith
2) Make PHPSQLLexer use hashes instead of searching arrays
3) Store the keywords in uppercase
I know 3 might seems like a minor thing, but once the other changes are made,
the test suite spends about half its time doing those uppercases!
I've checked and it's 100% passing the test suite. I haven't changed the
algorithm at all - I know it looks like I've rewritten PHPSQLLexer::split() but
it does exactly the same, just a slightly different way.
On my machine the test suite takes ~20.6 seconds with the original, ~0.8
seconds with this new one.
Original comment by rob...@gmail.com
on 11 May 2012 at 7:47
Attachments:
I have implemented 2) and 3) in trunk. I think, the lexer has potential to more
optizations. Some functions use unset() and copy the tokens with
array_values(). Maybe it is faster to transfer the tokens into a result array
within the loops.
Currently I'm looking for 1).
The current version on trunk needs on my machine 2.2 seconds instead of 10
seconds, so we have a factor of about 5.
Is it possible in PHP to have static class variables? If I would initialize the
functions/reserved arrays only once, it will be faster...
Original comment by pho...@gmx.de
on 11 May 2012 at 10:48
Yup here's a copy with static class variables...
Original comment by rob...@gmail.com
on 11 May 2012 at 10:57
Attachments:
@Andre
Many thanks on all your effort and fantastic answer.
I'll take a look more, at parser's internals, in the next days.
Original comment by kor...@gmail.com
on 11 May 2012 at 7:32
[deleted comment]
[deleted comment]
[deleted comment]
Or since it is very speed critical this should be the best.
I've changed the params to pass-by-reference which will eliminate memory
allocation/copying for each invocation.
I removed the substr() which will require memory allocation and I replaced it
with strncmp() which will compare the strings without allocating any new memory
or doing any new copying.
Finally, I added an if() block that avoids allocating memory just to pack a
string into an array, for it to immediately be unpacked.
protected function startsWith(&$haystack, &$needles) {
if (is_string($needles)) {
if (strncmp($haystack, $needles,strlen($needles)) === true) return true;
} else {
foreach($needles as $needle) {
if (strncmp($haystack, $needle,strlen($needle)) === true) return $j;
}
}
return false;
}
Original comment by greenlion@gmail.com
on 2 Jul 2012 at 8:14
I tested that change, but it makes no effective difference :)
Original comment by greenlion@gmail.com
on 2 Jul 2012 at 8:34
:-)
I have removed all calls to startsWith() in a previous revision, the method is
dead code, I remove it. Sorry...
Original comment by pho...@gmx.de
on 3 Jul 2012 at 8:26
Original comment by pho...@gmx.de
on 3 Apr 2014 at 7:30
Original issue reported on code.google.com by
rob...@gmail.com
on 9 May 2012 at 1:59Attachments: