make crawl parsing faster

GoogleCodeExporter commented 9 years ago

Right now the "parse" phase of the crawl can't keep up with the speed at which 
the user agents generate test results. I improved the flexibility of forking 
multiple processes, but then exceeded the CPUs of the server. So now it's time 
to look at the parsing code and make it more efficient.

Original issue reported on code.google.com by stevesou...@gmail.com on 19 Dec 2012 at 5:24

GoogleCodeExporter commented 9 years ago

Could you provide some more information on how this is handled in the source as 
I might be able to contribute to improvements. From a brief inspection of the 
source it looks like "obtainXMLResult" does most of the work. If it is using a 
DOM-object for this then it should be faster to use a stream which uses less 
memory.

Original comment by charlie....@clark-consulting.eu on 4 Jan 2013 at 6:18

GoogleCodeExporter commented 9 years ago

Parsing is now 100x faster.

The issue was divvying up the jobs across multiple parsing scripts. That's done 
using a modulo that maps to the parse task #. Unfortunately, the original code 
did NOT zero-base the task #, so two processes were handling the same jobs. 
This caused duplicate records which required a REPLACE (rather than INSERT). 
The REPLACE caused all the other processes to lock for ~6 seconds. 

The fix was to just do INSERT and ignore the second duplicate.

Original comment by stevesou...@gmail.com on 9 Jan 2013 at 8:20

Changed state: Fixed

praveenbankbazaar / httparchive

make crawl parsing faster #346