kootenpv / sky

:sunrise: next generation web crawling using machine intelligence
BSD 3-Clause "New" or "Revised" License
329 stars 44 forks source link

how to config "publish_date" #3

Closed wannaphong closed 8 years ago

wannaphong commented 8 years ago

from https://github.com/kootenpv/sky/blob/master/sky/configs.py I config "publish_date"

'overwrite_values_by_xpath': {
            'publish_date': "2014-05"

but , I open a crawl file. 'publish_date' isn't "2014-05". In url data, It's date 2012-04-01. I thank you for the sky module.Thank you.

kootenpv commented 8 years ago

'publish_date':

If you would like to hard code this value, you should put string characters around it, since it will evaluate it using publish_date = html_tree.xpath(config['overwrite_values_by_xpath']['publish_date']).

e.g.

        'publish_date': "'2014-05'"

Note the single quotes

kootenpv commented 8 years ago

@wannaphongcom Did this work as intended?

wannaphong commented 8 years ago

It isn't work. My code

default = crawler_service.get_crawl_plugin('default') 
default.save_config(DEFAULT_CRAWL_CONFIG)
news_config = {
'seed_urls' : ['http://www.bbc.com/news/business'],
'overwrite_values_by_xpath': {
'publish_date': "'2014-05'"},
'index_required_regexps' : ['news/business-'], 
'max_saved_responses' : 20, 
}

Thank you.

kootenpv commented 8 years ago

@wannaphongcom Which backend do (Cloudant, ElasticSearch, ZODB) you use? What version of sky are you using?

wannaphong commented 8 years ago

I using sky 0.0.193 and ZODB.

wannaphong commented 8 years ago

some result.

{"summary": "", "scrape_date": "2015-12-02T21:12:58","publish_date": "2014-05", "url": "http://www.bbc.com/news/business-18854396",... }

in url, It isn't publish date "2014-05". Thank you.

kootenpv commented 8 years ago

I think it is because it did not get a good enough template set yet. When you try

'max_saved_responses': 100

it does stabilise:

 "publish_date": "2012-07-16",
 "url": "http://www.bbc.com/news/business-18854396"

For trying this, make sure to delete the ZODB database and start over :)

Does it work like that for you?

wannaphong commented 8 years ago

It's error. My code

default = crawler_service.get_crawl_plugin('default') 
default.save_config(DEFAULT_CRAWL_CONFIG)
news_config = {
'seed_urls' : ['http://www.bbc.com/news/business'],
'overwrite_values_by_xpath': {
"publish_date": "2012-07-16",
 "url": "http://www.bbc.com/news/business-18854396" },
'index_required_regexps' : ['news/business-'], 
'max_saved_responses' : 20, 
}
news = crawler_service.get_crawl_plugin('bbc')
news.save_config(news_config)
crawler_service.run('bbc')

result

INFO:sky.crawler.crawling:Queue: 395, FOUND ~2 visitable urls from 'http://www.b bc.com/news/business-26169116', ERROR:sky.crawler.crawling:CRITICAL ERROR IN SCRAPER for url 'http://www.bbc.com /news/business-26133269': 'Invalid expression', stack 'Traceback (most recent ca ll last):\n File "C:\py34\lib\site-packages\sky\crawler\crawling.py", lin e 494, in save_response\n self.data[url] = self.scraper.process(url, tree, Fa lse, [\'cleaned\'])\n File "C:\py34\lib\site-packages\sky\scraper.py", lin e 279, in process\n new = tree.xpath(v)\n File "src\lxml\lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src\lxml\lxml.etree.c:58124)\n File "src\lxml\xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.call ( src\lxml\lxml.etree.c:167145)\n File "src\lxml\xpath.pxi", line 227, in lxm l.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:166104)\nlxm l.etree.XPathEvalError: Invalid expression\n'

kootenpv commented 8 years ago

Why do you want to overwrite publish_date? It should not be necessary.

If you really want to force the publish_date and url, then:

'overwrite_values_by_xpath': {
"publish_date": "'2012-07-16'",
"url": "'http://www.bbc.com/news/business-18854396'" }, 
wannaphong commented 8 years ago

Thank you.