Closed stooit closed 4 years ago
I have started looking into this. Performance is definitely hit in these circumstances. When content is parsed and links extracted and marked off the time for any function checking the url collection starts blowing out and contributing to the total time to process a node. You can see it makes many thousands of calls checking contains() after only a few dozen urls.
OK, I believe the root cause of this is because the SpatieCrawler was using some kind of fancy collections class for its queue system. However, this uses lots of looping to find elements instead of any kind of key-based hashing. This means that as our url list grew the contains() function iterated the list millions of times.
The good news is that a "simple" array can do the same job.
This is what that change looks like in time (s) - the key numbers are content_into_link_adder and mark_as_processed (lower = better) :
class | urls in 20 mins | json_decode_cache | content_into_link_adder | mark_as_processed() | fake_response_crawled() |
---|---|---|---|---|---|
Collection | 265 | 0.00172933 | 4.48878198 | 0.02165289 | 0.00034714 |
Array | 5241 | 0.00357562 | 0.07701911 | 1.1306E-05 | 0.0004495 |
From 265 nodes in 20 mins to 5241 😄
Old (average content_into_link_adder 4.49s):
New (average content_into_link_adder 0.077s:
I will make similar changes to the Fetcher (which suffers from the same issue) and push up within next day.
Describe the bug The crawler (spider) gets progressively slower on very large sites (e.g when crawling 10s of thousands of URLs).
Sample configuration TBD
Expected behavior Crawler performance should not degrade over time, related to #28 but likely due to ongoing hash checks for duplicate content.