pmyteh / RISJbot

A scrapy project to extract the text and metadata of articles from news websites
71 stars 30 forks source link

RISJbot

A scrapy project to extract the text and metadata of articles from news websites.

This should provide much of the structure and parsing code needed to fetch from arbitrary news websites. It may work out-of-the-box on some or more of the sites with specific spiders already written (see below) but be aware that web scrapers are by their nature somewhat brittle: they depend on the underlying format and structure of each site's pages, and when these are changed they tend to break. Although RISJbot has a fallback scraper that does a reasonable job with arbitrary news pages, it's not a substitute for a hand-tailored spider.

Having some degree of experience with Python would be very helpful. If sites update their templates or you want to add a new site to the collection then some coding will be necessary. I've tried to ensure that the existing code is well commented. The Scrapy docs are themselves quite good if you find yourself needing to understand what is going on behind the scenes.

You should be aware that this was written to support the author's academic research into online news. It is still actively (if slowly) developed for that purpose, but it is not production-level code and comes with even fewer guarantees than most Free software.

Installation

This is a Scrapy project, so first you need a working Scrapy installation: https://docs.scrapy.org/en/latest/intro/install.html

The second thing to do is to clone RISJbot and edit settings.py to set things up how you want them. The example settings file does most things sensibly. Make sure you set up a sensible place for the crawl data to be put.

Third, customise a crawler for the site you want to scrape. There are some working examples in the project, so the 'guardian' crawler fetches articles from Britain's The Guardian, for example.

Fourth, use pip to install the dependencies in requirements.txt (currently specific versions of dateparser, extruct, textblob, pronouncing, scrapy-dotpersistence, scrapy-splash and readability-lxml.

The final thing is to run the crawler: scrapy crawl guardian will fetch its way through The Guardian. But at this point it's basically an ordinary Scrapy installation, and the regular Scrapy docs should see you through.

Output will be sent in JSONLines format to the S3 bucket you configured in settings.py, or (if you haven't given credentials) to a jsonloutput directory in the current directory.

If you want to load your output data into an R-based analysis framework via the tm package, there is a companion package tm.plugin.risjbot which does this easily. From there it's straightforward to convert it for use with quanteda, a more modern (and more actively maintained) R-based ecosystem. JSONLines libraries are also readily available for other programming languages and workflows.

Spiders

This project contains a number of scrapy spiders to extract data from specific US and UK websites:

Page formats change, so not all of these spiders may be currently operational. RISJbot now has a fallback text extractor using the readability library which may help a bit. Nevertheless, pull requests to fix spider brokenness are most welcome.

Do also be aware that there are some geographical issues that can be run into. USA Today, for example, provides a different site for users geolocated to the EU, which means that a working crawler can stop working when your computer moves.

The source of URLs to crawl is generally either a public RSS feed of new articles, or the sitemaps published to alert Google News of the articles available. You may be able to find suitable feeds through the feed_seeker package.

As an alternative, it's possible to crawl a specified list of URLs from a file. This is implemented in the NewsSpecifiedSpider class; see spiders/uk/guardian.py for a working example.

A spider class is also available for doing a link-following crawl via Splash (a headless browser which allows JavaScript-heavy pages to be properly handled).

Middlewares and extensions

In addition to the spiders, there are a number of interesting new pieces of middleware and extensions which expand crawling possibilities for this and other projects:

FlexibleDotScrapyPersistence

An extension for projects hosted on ScrapingHub, using a hacky subclassing of DotScrapyPersistence to allow persistent content to be stored in an arbitrary S3 bucket rather than in ScrapingHub's own.

RefetchControl

This is a spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider. It is a modified version of http://github.com/scrapy-deltafetch/DeltaFetch v1.2.1.

RefetchControl differs from the parent DeltaFetch by offering more general control over repeated fetching:

Depends on sqlite3 instead of bsddb3.

EquivalentDomains

Spider middleware to coerce sets of equivalent domains to a single canonical location. This can deal with situations like http://editions.cnn.com and http://www.cnn.com, which deliver identical content. Should be put early in the chain.

ExtractJSONLD

Spider middleware to extract JSON-LD blocks and save their data into the Response's meta tag. This stops them being squelched by Githubissues.

  • Githubissues is a development platform for aggregating issues.