Parser for DR.dk - Githubissues

ndarville commented 10 years ago

[ ] Ensure all articles (and their sections) are captured
[ ] Working URL schemes
[ ] Working DOM scraper

ndarville commented 10 years ago

from baseparser import BaseParser
from BeautifulSoup import BeautifulSoup

class InformationParser(BaseParser):
    feeder_pat = '^http://www.dr.dk/Nyheder/(Politik|Indland|Udland)/'
    feeder_pages =  [
        'http://www.dr.dk/nyheder/allenyheder/indland',
        'http://www.dr.dk/nyheder/allenyheder/udland',
        'http://www.dr.dk/nyheder/allenyheder/politik'
    ]

    def _parse(self, html):
        """Retrieve and serve the required fields to create an entry."""
        soup = BeautifulSoup(html,
            convertEntities=BeautifulSoup.HTML_ENTITIES,
            fromEncoding='utf-8')

        self.meta = soup.findAll('meta')
        self.title = soup.find('h1').getText()
        self.date = soup.find('time', {'itemprop': 'datetime'}).getText()
        self.byline = soup.find('span', 'author-name').next.getText()

      # body_container = ('div', 'wcms-article-content')
      # body_summary = body_container.find('p', 'summary').getText()
        self.body =  ""

ndarville commented 10 years ago

The articles don’t have a container, so FML.

ndarville / newsdiffs

Parser for DR.dk #2