scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 273 forks source link

Slow Extraction Times #11

Closed javadi82 closed 12 years ago

javadi82 commented 12 years ago

It's currently taking me around 2s to run the extraction on a single page.

Following is the output of the lineprofiler: ''' Line #, Hits, Time, Per Hit, % Time, Line Contents

53                                           def extract(url, page, scraper):
54                                               """Returns a dictionary containing the extraction output
55                                               """
56        10         2923    292.3      0.1      page = unicode(page, errors = 'ignore')
57        10       704147  70414.7     17.8      html_page = HtmlPage(url, body=page, encoding = 'utf-8')
58                                           
59        10      2604545 260454.5     65.9      ex = InstanceBasedLearningExtractor(scraper.templates)
60        10       640413  64041.3     16.2      records = ex.extract(html_page)[0]
61        10          141     14.1      0.0      return records[0]

'''

Am I doing something wrong ? The extraction code is similar to that found in tool.py and init.py But, I get faster extraction times when I run scrapely from the command line than using the code above.

Please advice.

shaneaevans commented 12 years ago

Generally, the InstanceBasedLearningExtractor is initialized once with all templates that could be relevant and used on many pages. Can you do this in your application and does it improve the performance for you?

javadi82 commented 12 years ago

The extraction time is now reduced by half by initializing the extractor object just once, as recommended. But, the current extraction times still approximate at 1min per URL. (Running extraction on 1438 URLs takes 13 minutes). Is it possible to optimize the calls to the HtmlPage constructor.

Still, the time taken does not sync up with that expected using the scrapely.tool which is ~1-2s.

Following is the profiler output after the recommended change:

''' Line #, Hits, Time, Per Hit, % Time, Line Contents

67                                           @profile
68                                           def extract(url, page, extractor):
69                                               """Returns dictionary containing the extraction output
70                                               """
71      5741      5947200   1035.9      0.4      page = unicode(page, errors = 'ignore')
72      5741   1030895392 179567.2     75.4      html_page = HtmlPage(url, body=page, encoding = 'utf-8')
73                                           
74      5741    330400129  57551.0     24.2      records = extractor.extract(html_page)[0]
75      5741        31588      5.5      0.0      return records[0]

'''

shaneaevans commented 12 years ago

On my laptop, this simple example (URLs taken from your other ticket) managed to process 100 pages per second. To go beyond this, it's usually easiest to have multiple processes and split the work between them. This seems very far from the speed you are reporting.

Are you sure there is not some mistake, like also counting the page download time and assuming that extraction is the bottleneck? (as in, your application is IO bound downloading pages and you're looking for where the CPU time is being spent) or perhaps counting the time while running with the profiler enabled?

If you can create a test that reproduces this behavior it would be easier to help. Some details about your platform would also be useful.

shaneaevans commented 12 years ago

Unable to reproduce and no more details have been forthcoming.