Closed javadi82 closed 12 years ago
Generally, the InstanceBasedLearningExtractor is initialized once with all templates that could be relevant and used on many pages. Can you do this in your application and does it improve the performance for you?
The extraction time is now reduced by half by initializing the extractor object just once, as recommended. But, the current extraction times still approximate at 1min per URL. (Running extraction on 1438 URLs takes 13 minutes). Is it possible to optimize the calls to the HtmlPage constructor.
Still, the time taken does not sync up with that expected using the scrapely.tool which is ~1-2s.
Following is the profiler output after the recommended change:
''' Line #, Hits, Time, Per Hit, % Time, Line Contents
67 @profile
68 def extract(url, page, extractor):
69 """Returns dictionary containing the extraction output
70 """
71 5741 5947200 1035.9 0.4 page = unicode(page, errors = 'ignore')
72 5741 1030895392 179567.2 75.4 html_page = HtmlPage(url, body=page, encoding = 'utf-8')
73
74 5741 330400129 57551.0 24.2 records = extractor.extract(html_page)[0]
75 5741 31588 5.5 0.0 return records[0]
'''
On my laptop, this simple example (URLs taken from your other ticket) managed to process 100 pages per second. To go beyond this, it's usually easiest to have multiple processes and split the work between them. This seems very far from the speed you are reporting.
Are you sure there is not some mistake, like also counting the page download time and assuming that extraction is the bottleneck? (as in, your application is IO bound downloading pages and you're looking for where the CPU time is being spent) or perhaps counting the time while running with the profiler enabled?
If you can create a test that reproduces this behavior it would be easier to help. Some details about your platform would also be useful.
Unable to reproduce and no more details have been forthcoming.
It's currently taking me around 2s to run the extraction on a single page.
Following is the output of the lineprofiler: ''' Line #, Hits, Time, Per Hit, % Time, Line Contents
'''
Am I doing something wrong ? The extraction code is similar to that found in tool.py and init.py But, I get faster extraction times when I run scrapely from the command line than using the code above.
Please advice.