gt-big-data / retina-crawler

A news crawler for the Retina Project
This repo is in Python 2.7.8. You will need to

  1. Install Python 2.7.8 and add Python to your path (if installing with apt-get or brew or an equivelant package manager on linux and mac systems, this should happen automatically
  2. Install pip the python package manager.
  3. Optionally install virtualenv by running pip install virtualenv


To parse xml articles, you'll need two system packages, libxml2 and libxsl. On ubuntu, install with sudo apt-get install libxml2 libxslt1-dev

Then, install the required python libraries with: pip install -r requirements.txt

Windows Note: Running the above command will only partially work and will error on libxml. You must manually download and install it.

Using Mongo

  1. Install mongo
  2. Install genghisapp with gem install genghisapp, which is like PHPMyAdmin for MongoDB. genghisapp requires ruby / rubygems. You can install ruby by following this guide and install gem by downloading and installing from here

Running everything

Once everything is installed, you can run the crawler with

python configs/simple-config.json

This will run the crawler with the simplest possible setup. It will crawl articles from the main CNN RSS feed and write them to a directory as JSON files.

To configure different behavior, you can specify a different configuration file. There are several pre-built configuration files in the configs/ directory. If none of them do what you want, consider making a new configuration. See configs/ for more details.

Running with Vagrant

The project directory is by default mapped to /vagrant in the virtual machine.