Evaluate performance and memory usage

nikic commented 7 years ago

In particular as we now have https://github.com/Microsoft/tolerant-php-parser as a comparison point.

Interesting parts to check would be performance of parsing, node traversal and pretty printing, as well as the memory usage of the AST and whether it changes during operations (I suspect node traversal increases it due to use of references).

Also relevant is the discussion in #312 on the attribute system. I suspect that is currently the heaviest part in terms of memory usage. It might be time to make location information a first-class citizen, as it is increasingly used.

nikic commented 7 years ago

Landed two micro-optimizations https://github.com/nikic/PHP-Parser/commit/3bbf8d8f7a21397030f9729041df3cf5cb74d217 and https://github.com/nikic/PHP-Parser/commit/8f623fb2413f95cc3cefe0747c22f8abe89692cf, which together improve parser performance by about ~12%.

From a quick run through massif, the memory overhead of the attribute system is huge. Attributes are using more then 3x more memory than the AST objects themselves. This is not particularly surprising, as non-empty arrays in PHP have a minimum size of 344 bytes, while objects start at 40. I interpreted some profiles incorrectly, the actual overhead of the attributes is closer to 80%.

It would be great to improve on that... one way would be to make locations (the main use of attributes) a first-class Location object. Even in an unoptimized represention with six properties (three pairs for line, offset and token offset) this would weight in at 136 bytes, 2.5x less than the attributes array. A compact representation with three properties would weigh in at 88 bytes, which is 4x less than the attributes array.

The big issue this (apart from the change itself, which will at least require changes to the node constructor signatures) is that we get the full cost of the attributes array back as soon as even a single other attribute is set. In which case we'd end up in a worse situation than we started with, because we'd have to carry both the cost of the Location object and the attributes array (which starts at 344 bytes).

It's pretty common that one or two attributes are added to each node. For example, the CloningVisitor in this project adds an origNode attribute to each node. I've seen other projects add a parentNode attribute to each node. Any of those would reintroduce the attributes array (or, if they are added as dynamic properties, the dynamic properties table, which has the same cost).

I'm not sure how this can be solved... One very crazy idea I had was to add a use PhpParser\NodeExt; line in the NodeAbstract class, for which we provide an empty default implementation, but other projects can overwrite in their bootstrap process, to be something like this:

namespace PhpParser {
    if (!class_exists('PhpParser\NodeExt')) {
        trait NodeExt {
            /** @var Node */
            public $parentNode;
        }
    }
}

This would allow one using project to add additional properties to the node structure, avoiding memory increase and maybe also providing better autocomplete support. If multiple libraries define it, then only one will be chosen -- which is per-se not problematic, as it does not change behavior, only make memory usage worse.

Gert-dev commented 7 years ago

Thanks for looking into improving performance and memory usage, I do a lot of parsing and traversing/visiting using php-parser and will certainly benefit from it.

I notice that especially large files become problematic. I discovered a file that takes around 10 seconds to parse here, which is part of the TCPDF fonts that encode font data in PHP files. I know, not the greatest idea, but they really pressure the parser's performance as well as spike memory usage. These files may be a bit artificial, but I wanted to pass these to you anyway as sample to measure possible memory and performance improvements.

As for storing the attributes in associative arrays: I'm not deeply familiar with PHP internals, but I have heard of the php-ds project which offers a PHP library with a Map data type that can automatically switch to a PHP extension implementation if it's enabled. This way users enabling that extension could automatically squeeze out extra performance and ones without it will just transparantly fall back to the plain PHP implementation. I don't know if this will help memory usage and performance, but as the extension is written in plain C it could be worth investigating. Also, its docs are here, in case it's hard to find.

nikic commented 7 years ago

This is one of those cases where we run afoul of PHP's inadequate GC algorithm -- the majority of time is not spent parsing, but rather scanning the object graph again and again, without ever freeing anything. For me parsing the provided file takes 12.7 seconds withe the GC enabled and 1.4 seconds if it is disabled.

Because the majority of the time here is spent in the GC, this is a case that will not benefit from optimizations in the parser. It may be advisable to disable the GC during parsing, but this is a somewhat dangerous proposition, especially as the parser is commonly used in longer-running code. While the AST is acyclic by-design and no cycle leaks may originate from there, the construction code is not necessarily cycle free (e.g. the parser holds onto cycles via closures, so destroying it before the GC is reenabled would cause a leak.)

Unless this issue is fixed upstream, I can only suggest to not parse files that exceed a certain file size -- as in this case, such files tend to be data anyway.

Regarding php-ds: It's not really possible to use this library unless you add a hard requirement on the extension. While a PHP implementation exists, it is extremely slow. For example, map lookups are O(n) in the PHP implementation, while they should be O(1). There has recently been some discussion about dropping the polyfill altogether, for this reason.

theofidry commented 7 years ago

@nikic it may be worth mentioning that in the docs, GC and/or xdebug really makes it tremendously slower, often between 10-20x time difference

nikic commented 7 years ago

@theofidry I've added a performance section to the docs: https://github.com/nikic/PHP-Parser/blob/master/doc/component/Performance.markdown

theofidry commented 7 years ago

👌 Many thanks

Gert-dev commented 6 years ago

@nikic Your suggestion regarding the GC was a very good one. I found that disabling it in my (long-running) process resulted in a rather noticeable performance improvement and reduction in latency. I've switched to manually collecting circular references every so often using gc_collect_cycles instead.

I also wanted to drop a, potentially rather extreme, idea I had when investigating potential performance improvements in my own code, which I realized might also be relevant to php-parser: would it be possible to parallellize parsing and/or traversing?

I'm not sure as to what extent parallellization is possible for parsing, but in the case of visiting it would (?) be possible to spawn up to n workers simultaneously for the root nodes of an array, which would then spawn more workers (up to a maximum) as they descend down the tree and invoke the visitors. I'm thinking of something like D's parallel or OpenMP's parallel for.

This should probably be put behind a switch, since the caller's visitors may or may not be ready for asynchronous execution; they may be holding state based on the previously hit node, for example.

I'm also not sure how good threading support is in PHP (I've heard the, not so positive but possibly outdated, stories, however). In any case, as an extension such as pthreads, which seems to be one of the few extensions that works on all major platforms, can't really be a hard requirement for php-parser, something like a polyfill that reverts to the old behavior or simply a conditional switch may be desired.

Just to be clear: this isn't a feature request and it may not even be actually possible in php-parser, but I wanted to throw the idea out there.

nikic commented 6 years ago

@Gert-dev Just to make sure there is no misunderstanding here: Even if you manually invoke the cycle collector using gc_collect_cycles(), this will only perform GC on the objects in the root buffer. If the root buffer has run full, additional objects will not be added anymore. As such, manual calls to gc_collect_cycles() are going to leak cycles which are not rooted within those first 10k objects. Especially for a long-running process this might become a concern.

Regarding the parallelization, I don't believe parallelizing the parsing of a single file is possible (though of course processing multiple files in parallel is possible). Parallelizing the traversal should be possible (as you describe) if the visitor is order-independent, though I suspect that the overhead of the parallelization will be much higher in this case than the benefit.

Gert-dev commented 6 years ago

@nikic You're absolutely right and I'm aware. I should investigate this in detail at some point. Granted, there are a lot of objects going around, but if and why I specifically hit this limit I have not figured out yet (perhaps the solution is even rather trivial to avoid, once tracked down).

I realize disabling the circular reference collector is somewhat of a band aid in this regard, but the situation currently being as it is (i.e. me not having found the root cause yet), as I see it, the references are already leaking, so if I can improve performance by not worsening the already poor situation, so be it 😄.

Unfortunately for me, since my project handles code bases of theoretically infinite sizes, there will likely be a point where this becomes close to unavoidable or where extreme measures will be required to avoid the buffer hitting its maximum.

Apart from that, there is probably a reason for having chosen the amount of 10.000 - my guess is a buffer is allocated up front and they needed a sane default for its size, which is okay - but I don't really understand why this is currently handled rather poorly in extreme situations:

There is no INI directive to change it - admittedly it probably wasn't necessary for web request handling, which is what PHP seems to be used most for
There is no backoff mechanism in place - as mentioned in the article of Anthony Ferrara
There is no automatic reallocation of the buffer to a larger, e.g. double, size - a bit like C++'s std::vector does, IIRC

In conclusion, the PHP runtime basically just throws its hands up in the air and says "Damn, son, I can't handle this many objects, I'm bailing". All of this is happening completely without any sort of warning to the developer. In languages such as C there is no one to tell you you are leaking memory (it's your own fault, after all, should have freed it if you allocated it), but in this case PHP could at least warn that there is no more room in the buffer and a leak may ensue.

... but you don't need to answer any of this, I just wanted to vent a bit of frustration 😉.

Regarding the parallelization (with proper spelling, this time 😄), you may be right. I guess it all depends on how many objects there are to traverse as well as what the user is doing in the visitors themselves. If nothing else than collecting some data is happening, even for larger hierarchies, I suspect parallelizing them will have little effect.

nikic commented 6 years ago

Just to be clear, as long as you don't disable GC, hitting the root buffer limit isn't going to cause a leak --- it "only" kills performance if it's happening too quickly.

We have recently been working on a solution to the performance problem and the final iteration of the implementation (https://github.com/php/php-src/pull/3165) has landed today. As such, the GC issue should be mostly fixed in PHP 7.3.

Gert-dev commented 6 years ago

@nikic That's great to hear! I must admit that I hoped my rant would somehow propagate up the chain to the core, as I knew you were involved there. My apologies for this sneaky tactic :wink: .

Purely out of interest and at the risk of going off-topic - I'm also not familiar with PHP core -, but if I don't disable GC and I hit the limit, how are new roots being tracked if none of the existing ones can be dropped? I assumed that some sort of static buffer with a maximum length was being used, so I naturally assumed that any new entries would not be able to fit into this buffer, causing them to be dropped and in turn resulting in these circular references never being able to be resolved and, thus, freed, resulting in a memory leak.

EDIT: On a more on-topic note. I'm not sure if this is feasible at all, but some sort of incremental parsing system might also provide a boon for performance in scenario's where little changes are made to source code, such as in language server scenario's. These currently require some sort of intelligent mechanism on the caller's side to determine what parts of the code to parse or just reparsing the entire source code. Apparently atom's upcoming tree-sitter does something similar. On a side note, it apparently can also recover from errors and seems to be fairly generic, there was a talk about it at FOSDEM that's freely viewable here.

muglug commented 6 years ago

@Gert-dev @nikic I'm working on incremental parsing for Psalm. Current activity is in https://github.com/vimeo/psalm/tree/server-temp-changes but it'll likely get merged into master soon.

For dealing with a change in a method in a large file (I'm using one from PHPMailer) has the following results:

Partial parsing: diffing lines:     0.0073
Partial parsing: cloning:           0.0169
Partial parsing: parsing traverser: 0.0250
------------------------------------------
Partial parsing:                    0.0499
Full parsing:                       0.0842

Parsing an even larger file, a closed-source 10K-line behemoth, gives these results:

Partial parsing: diffing lines:     0.0182
Partial parsing: cloning:           0.0481
Partial parsing: parsing traverser: 0.0762
------------------------------------------
Partial parsing:                    0.1438 
Full parsing:                       0.2750

mattacosta commented 6 years ago

Results from parsing every PHP file in some large frameworks:

Drupal (8.6.x @ 5d17da1) - 8401 files

parser	time (ms)	memory (kb)
microsoft/tolerant-php-parser	`16455.64`	`1174311.48`
nikic/php-parser	`25693.36`	`1784685.80`

Magento2 (2.3-develop @ b252ca0) - 21688 files

parser	time (ms)	memory (kb)
microsoft/tolerant-php-parser	`32648.24`	`2384537.00`
nikic/php-parser	`43200.28`	`3423339.44`

Gert-dev commented 6 years ago

@muglug Looks interesting! The branch seems to have disappeared in the meantime, is it this commit you are referring to?

It would be neat if this could be a decorator just implementing the Parser interface that works transparently by only parsing the necessary parts of the file when it detects it's been changed (and otherwise just does the normal parsing).

The results of the operation could then be stored somewhere in a predesignated folder (such as /tmp or memory). In my case I'd probably need to limit that cache to a fixed size as code bases can be large, but that shouldn't be a problem, as most recently used files - the ones the user is working in - will stay hot in the cache anyway.

@mattacosta Interesting benchmarks. One should however provide some nuance to these benchmarks as it appears the tolerant PHP parser is not fully compliant yet - at least according to the checkboxes in its README. Thus while it may give some insight, it currently is a bit like comparing apples to oranges as, in order to become as compliant as PHP-Parser is, performance may degrade as a result.

muglug commented 6 years ago

@Gert-dev here's the (more mature) file that's used today: https://github.com/vimeo/psalm/blob/03ea94e087a10a69def73a1b6e38121bfe0f301f/src/Psalm/Visitor/PartialParserVisitor.php

mattacosta commented 6 years ago

@Gert-dev There haven't been any significant changes to that README in almost two years. For instance, it still references a custom lexer, but that was abandoned prior to release and it currently uses token_get_all instead. I do not expect a PHP-based implementation to be faster than that either.

As for compliance, you are correct in that it still has a lot of room for improvement, especially with regard to expressions. Even if corrected however (and that is a big if), I do not expect the time-complexity of the methods used to parse those rules to change significantly, so any performance changes should be minimal.

For those interested in pure speed, I should also mention that I tested some fully-representative, JS-based implementations and they were both roughly twice as fast (or better) than their PHP-based counterparts. While that really is apples to oranges, I still think it'd be interesting to see what could be done to close the gap.

vudaltsov commented 9 months ago

https://github.com/nikic/PHP-Parser/issues/980 seems to be relevant to this issue.

nikic / PHP-Parser

Evaluate performance and memory usage #349