Crawler performance at scale

stooit commented 4 years ago

Describe the bug The crawler (spider) gets progressively slower on very large sites (e.g when crawling 10s of thousands of URLs).

Sample configuration TBD

Expected behavior Crawler performance should not degrade over time, related to #28 but likely due to ongoing hash checks for duplicate content.

derklempner commented 4 years ago

I have started looking into this. Performance is definitely hit in these circumstances. When content is parsed and links extracted and marked off the time for any function checking the url collection starts blowing out and contributing to the total time to process a node. You can see it makes many thousands of calls checking contains() after only a few dozen urls.

Screen Shot 2019-11-10 at 13 28 21

derklempner commented 4 years ago

OK, I believe the root cause of this is because the SpatieCrawler was using some kind of fancy collections class for its queue system. However, this uses lots of looping to find elements instead of any kind of key-based hashing. This means that as our url list grew the contains() function iterated the list millions of times.

The good news is that a "simple" array can do the same job.

This is what that change looks like in time (s) - the key numbers are content_into_link_adder and mark_as_processed (lower = better) :

class	urls in 20 mins	json_decode_cache	content_into_link_adder	mark_as_processed()	fake_response_crawled()
Collection	265	0.00172933	4.48878198	0.02165289	0.00034714
Array	5241	0.00357562	0.07701911	1.1306E-05	0.0004495

From 265 nodes in 20 mins to 5241 😄

Old (average content_into_link_adder 4.49s): Screen Shot 2019-11-10 at 22 31 30

New (average content_into_link_adder 0.077s: Screen Shot 2019-11-10 at 22 15 14

I will make similar changes to the Fetcher (which suffers from the same issue) and push up within next day.

salsadigitalauorg / merlin-framework

Crawler performance at scale #115