vezaynk / Sitemap-Generator-Crawler

PHP script to recursively crawl websites and generate a sitemap. Zero dependencies.
https://www.bbss.dev
MIT License
243 stars 93 forks source link

Performance Improvements #26

Open vezaynk opened 7 years ago

vezaynk commented 7 years ago

Performance is below what is desired but it is hard to do anything beyond a certain point.

This issue should never be closed and will track whatever steps are taken towards a more optimised solution along with benchmarks.

vezaynk commented 7 years ago

I have marked this issue as "help wanted" as I can always use a hand as well as "hacktoberfest" to take advantage of free labour.

vezaynk commented 6 years ago

The last two patches (#67 and #68) resulted in an over 30% speed improvement on my laptop. It is less noticeable on more powerful machines.

eugenzaharia commented 4 years ago

You could try multi_curl() extension instead of curl() because it has parallel requests and that would save some response time. Also don't forget having a SimpleXMLElement or DOMDocument object, build the XML then write to the file instead of having multiple IO operations.

vezaynk commented 4 years ago

@eugenzaharia multi_curl is very awkward to work with. I mean PHP is awkward in general but that's a separate story.

I'd be cautious of using SimpleXMLElement/whatever else, as it can potentially introduce dependencies and break things for current users. The primary design goal of this script is for it to be able to run reliably in almost any php environment, including the weird ones.

Streaming to the file system instead of buffering the file is a result of a constraint, some websites turned out to have a lot of links and would run under low-memory conditions. Basically it runs out of memory less like this. Not really better or worse, design-wise.

I'd rather rewrite the entire thing into a dynamic PHP extension if we're rewriting things.