holgerd77 / django-dynamic-scraper

Creating Scrapy scrapers via the Django admin interface
http://django-dynamic-scraper.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.15k stars 310 forks source link

ERROR: Mandatory elem title missing! #26

Closed natea closed 11 years ago

natea commented 11 years ago

When I try to run the scraper on the example from the docs, it seems to get stuck scraping an item, and never saves the item to the Django database:

scrapy crawl article_spider -a id=1 -a do_action=yes
2013-09-11 20:18:25-0700 [scrapy] INFO: Scrapy 0.18.2 started (bot: open_news)
2013-09-11 20:18:25-0700 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2013-09-11 20:18:25-0700 [scrapy] DEBUG: Overridden settings: {'SPIDER_MODULES': ['dynamic_scraper.spiders', 'open_news.scraper'], 'ITEM_PIPELINES': ['dynamic_scraper.pipelines.ValidationPipeline', 'open_news.scraper.pipelines.DjangoWriterPipeline'], 'USER_AGENT': 'open_news/1.0', 'BOT_NAME': 'open_news'}
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Enabled item pipelines: ValidationPipeline, DjangoWriterPipeline
2013-09-11 20:18:26-0700 [article_spider] INFO: Spider opened
2013-09-11 20:18:26-0700 [article_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-09-11 20:18:26-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Main_Page> (referer: None)
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 1.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'The United States President Barack Obama announced last Saturday he was seeking Congressional authorisation for military intervention in Syria. Wikinews interviewed professors Scott Lucas, Professor of American Studies from the UK's University of Birmingham; Majid Rafizadeh, the President of the International American Council on the Middle East; and ProfEyal Zisser, a Syrian expert from Tel Aviv University about the risks of military intervention in Syria.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Wikinews_interviews_Scott_Lucas,_Eyal_Zisser,_Majid_Rafizadeh_about_risks_of_US_military_intervention_in_Syria'
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 2.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'attended the finals of the Women's National Wheelchair Basketball League at the Sydney University Sports and Aquatic Centre over the weekend.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Western_Stars_win_Women%27s_National_Wheelchair_Basketball_League_championship_in_a_thriller'
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 3.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'Local residents of Dungog, New South Wales, held a celebratory nature walk after they received assurance that their local forest was deemed worthy of "enduring protection." A proposal before the NSW government to log over one million hectares of protected national park forests had caused alarm among nature conservationists.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Dungog,_Australia_residents_celebrate_continued_protection_of_local_forest'
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 4.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'attended a roller derby event at the Caloundra Indoor Stadium on Australia's Sunshine Coast Saturday.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Coastal_Assassins_get_wins_at_roller_derby_event_on_Australia%27s_Sunshine_Coast'
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 5.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'A group of volcanologists from the UK and USA have traveled to North Korea to assist them with conducting scientific investigations near the volcano Mount Paektu.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Wikinews_interviews_Dr._Robert_Kelly_regarding_joint_scientific_venture_in_North_Korea'
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Western_Stars_win_Women%27s_National_Wheelchair_Basketball_League_championship_in_a_thriller> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u"attended the finals of the Women's National Wheelchair Basketball League at the Sydney University Sports and Aquatic Centre over the weekend.",
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Western_Stars_win_Women%27s_National_Wheelchair_Basketball_League_championship_in_a_thriller'}
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Coastal_Assassins_get_wins_at_roller_derby_event_on_Australia%27s_Sunshine_Coast> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Wikinews_interviews_Scott_Lucas,_Eyal_Zisser,_Majid_Rafizadeh_about_risks_of_US_military_intervention_in_Syria> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Dungog,_Australia_residents_celebrate_continued_protection_of_local_forest> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u"attended a roller derby event at the Caloundra Indoor Stadium on Australia's Sunshine Coast Saturday.",
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Coastal_Assassins_get_wins_at_roller_derby_event_on_Australia%27s_Sunshine_Coast'}
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Wikinews_interviews_Dr._Robert_Kelly_regarding_joint_scientific_venture_in_North_Korea> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u"The United States President Barack Obama announced last Saturday he was seeking Congressional authorisation for military intervention in Syria. Wikinews interviewed professors Scott Lucas, Professor of American Studies from the UK's University of Birmingham; Majid Rafizadeh, the President of the International American Council on the Middle East; and ProfEyal Zisser, a Syrian expert from Tel Aviv University about the risks of military intervention in Syria.",
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Wikinews_interviews_Scott_Lucas,_Eyal_Zisser,_Majid_Rafizadeh_about_risks_of_US_military_intervention_in_Syria'}
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u'Local residents of Dungog, New South Wales, held a celebratory nature walk after they received assurance that their local forest was deemed worthy of "enduring protection." A proposal before the NSW government to log over one million hectares of protected national park forests had caused alarm among nature conservationists.',
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Dungog,_Australia_residents_celebrate_continued_protection_of_local_forest'}
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u'A group of volcanologists from the UK and USA have traveled to North Korea to assist them with conducting scientific investigations near the volcano Mount Paektu.',
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Wikinews_interviews_Dr._Robert_Kelly_regarding_joint_scientific_venture_in_North_Korea'}
2013-09-11 20:18:27-0700 [article_spider] INFO: Closing spider (finished)
2013-09-11 20:18:27-0700 [article_spider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1921,
     'downloader/request_count': 6,
     'downloader/request_method_count/GET': 6,
     'downloader/response_bytes': 89787,
     'downloader/response_count': 6,
     'downloader/response_status_count/200': 6,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 9, 12, 3, 18, 27, 275875),
     'item_dropped_count': 5,
     'item_dropped_reasons_count/DropItem': 5,
     'log_count/DEBUG': 27,
     'log_count/ERROR': 5,
     'log_count/INFO': 8,
     'log_count/WARNING': 5,
     'request_depth_max': 1,
     'response_received_count': 6,
     'scheduler/dequeued': 6,
     'scheduler/dequeued/memory': 6,
     'scheduler/enqueued': 6,
     'scheduler/enqueued/memory': 6,
     'start_time': datetime.datetime(2013, 9, 12, 3, 18, 26, 203931)}
2013-09-11 20:18:27-0700 [article_spider] INFO: Spider closed (finished) 
scott-coates commented 11 years ago

I'm guessing the title xpath is incorrect: either because you didn't copy/paste correctly or the wiki site changed it's DOM structure. Take the Title xpath and go to a wiki page and test that xpath to see if something indeed shows up.

You can use the chrome console to test. Something like this: $x("//div[@class='my_title']")

natea commented 11 years ago

Ahh.. thanks. I unchecked the box "From detail page" for the title (Article) object, and used this XPath:

span[@class="l_title"]/a/text()