Closed wannaphong closed 8 years ago
'publish_date':
If you would like to hard code this value, you should put string characters around it, since it will evaluate it using publish_date = html_tree.xpath(config['overwrite_values_by_xpath']['publish_date'])
.
e.g.
'publish_date': "'2014-05'"
Note the single quotes
@wannaphongcom Did this work as intended?
It isn't work. My code
default = crawler_service.get_crawl_plugin('default')
default.save_config(DEFAULT_CRAWL_CONFIG)
news_config = {
'seed_urls' : ['http://www.bbc.com/news/business'],
'overwrite_values_by_xpath': {
'publish_date': "'2014-05'"},
'index_required_regexps' : ['news/business-'],
'max_saved_responses' : 20,
}
Thank you.
@wannaphongcom Which backend do (Cloudant, ElasticSearch, ZODB) you use? What version of sky are you using?
I using sky 0.0.193 and ZODB.
some result.
{"summary": "", "scrape_date": "2015-12-02T21:12:58","publish_date": "2014-05", "url": "http://www.bbc.com/news/business-18854396",... }
in url, It isn't publish date "2014-05". Thank you.
I think it is because it did not get a good enough template set yet. When you try
'max_saved_responses': 100
it does stabilise:
"publish_date": "2012-07-16",
"url": "http://www.bbc.com/news/business-18854396"
For trying this, make sure to delete the ZODB database and start over :)
Does it work like that for you?
It's error. My code
default = crawler_service.get_crawl_plugin('default')
default.save_config(DEFAULT_CRAWL_CONFIG)
news_config = {
'seed_urls' : ['http://www.bbc.com/news/business'],
'overwrite_values_by_xpath': {
"publish_date": "2012-07-16",
"url": "http://www.bbc.com/news/business-18854396" },
'index_required_regexps' : ['news/business-'],
'max_saved_responses' : 20,
}
news = crawler_service.get_crawl_plugin('bbc')
news.save_config(news_config)
crawler_service.run('bbc')
result
INFO:sky.crawler.crawling:Queue: 395, FOUND ~2 visitable urls from 'http://www.b bc.com/news/business-26169116', ERROR:sky.crawler.crawling:CRITICAL ERROR IN SCRAPER for url 'http://www.bbc.com /news/business-26133269': 'Invalid expression', stack 'Traceback (most recent ca ll last):\n File "C:\py34\lib\site-packages\sky\crawler\crawling.py", lin e 494, in save_response\n self.data[url] = self.scraper.process(url, tree, Fa lse, [\'cleaned\'])\n File "C:\py34\lib\site-packages\sky\scraper.py", lin e 279, in process\n new = tree.xpath(v)\n File "src\lxml\lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src\lxml\lxml.etree.c:58124)\n File "src\lxml\xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.call ( src\lxml\lxml.etree.c:167145)\n File "src\lxml\xpath.pxi", line 227, in lxm l.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:166104)\nlxm l.etree.XPathEvalError: Invalid expression\n'
Why do you want to overwrite publish_date
? It should not be necessary.
If you really want to force the publish_date
and url, then:
'overwrite_values_by_xpath': {
"publish_date": "'2012-07-16'",
"url": "'http://www.bbc.com/news/business-18854396'" },
Thank you.
from https://github.com/kootenpv/sky/blob/master/sky/configs.py I config "publish_date"
but , I open a crawl file. 'publish_date' isn't "2014-05". In url data, It's date 2012-04-01. I thank you for the sky module.Thank you.