A scrapy project to extract the text and metadata of articles from news websites.
This should provide much of the structure and parsing code needed to fetch from arbitrary news websites. It may work out-of-the-box on some or more of the sites with specific spiders already written (see below) but be aware that web scrapers are by their nature somewhat brittle: they depend on the underlying format and structure of each site's pages, and when these are changed they tend to break. Although RISJbot has a fallback scraper that does a reasonable job with arbitrary news pages, it's not a substitute for a hand-tailored spider.
Having some degree of experience with Python would be very helpful. If sites update their templates or you want to add a new site to the collection then some coding will be necessary. I've tried to ensure that the existing code is well commented. The Scrapy docs are themselves quite good if you find yourself needing to understand what is going on behind the scenes.
You should be aware that this was written to support the author's academic research into online news. It is still actively (if slowly) developed for that purpose, but it is not production-level code and comes with even fewer guarantees than most Free software.
This is a Scrapy project, so first you need a working Scrapy installation: https://docs.scrapy.org/en/latest/intro/install.html
The second thing to do is to clone RISJbot and edit settings.py
to set things
up how you want them. The example settings file does most things sensibly.
Make sure you set up a sensible place for the crawl data to be put.
Third, customise a crawler for the site you want to scrape. There are some working examples in the project, so the 'guardian' crawler fetches articles from Britain's The Guardian, for example.
Fourth, use pip to install the dependencies in requirements.txt
(currently
specific versions of dateparser
, extruct
, textblob
, pronouncing
,
scrapy-dotpersistence
, scrapy-splash
and readability-lxml
.
The final thing is to run the crawler: scrapy crawl guardian
will fetch
its way through The Guardian. But at this point it's basically an ordinary
Scrapy installation, and the regular Scrapy docs should see you through.
Output will be sent in JSONLines format to the S3 bucket you configured in
settings.py
, or (if you haven't given credentials) to a jsonloutput
directory in the current directory.
If you want to load your output data into an R-based analysis framework via the
tm
package, there is a companion package
tm.plugin.risjbot
which does
this easily. From there it's straightforward to convert it for use with
quanteda
, a more modern (and more actively maintained) R-based ecosystem.
JSONLines libraries are also readily available for other programming languages
and workflows.
This project contains a number of scrapy spiders to extract data from specific US and UK websites:
SPLASH_URL
configured in
settings.py
. See spiders/base/vice.py
for details.)Page formats change, so not all of these spiders may be currently operational. RISJbot now has a fallback text extractor using the readability library which may help a bit. Nevertheless, pull requests to fix spider brokenness are most welcome.
Do also be aware that there are some geographical issues that can be run into. USA Today, for example, provides a different site for users geolocated to the EU, which means that a working crawler can stop working when your computer moves.
The source of URLs to crawl is generally either a public RSS feed of new articles, or the sitemaps published to alert Google News of the articles available. You may be able to find suitable feeds through the feed_seeker package.
As an alternative, it's possible to crawl a specified list of URLs from a
file. This is implemented in the NewsSpecifiedSpider class; see
spiders/uk/guardian.py
for a working example.
A spider class is also available for doing a link-following crawl via Splash (a headless browser which allows JavaScript-heavy pages to be properly handled).
In addition to the spiders, there are a number of interesting new pieces of middleware and extensions which expand crawling possibilities for this and other projects:
An extension for projects hosted on ScrapingHub, using a hacky subclassing of DotScrapyPersistence to allow persistent content to be stored in an arbitrary S3 bucket rather than in ScrapingHub's own.
This is a spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider. It is a modified version of http://github.com/scrapy-deltafetch/DeltaFetch v1.2.1.
RefetchControl differs from the parent DeltaFetch by offering more general control over repeated fetching:
Depends on sqlite3 instead of bsddb3.
Spider middleware to coerce sets of equivalent domains to a single canonical location. This can deal with situations like http://editions.cnn.com and http://www.cnn.com, which deliver identical content. Should be put early in the chain.
Spider middleware to extract JSON-LD blocks and save their data into the Response's meta tag. This stops them being squelched by Githubissues.