This repo is in Python 2.7.8. You will need to
apt-get
or brew
or an equivelant package manager on linux and mac systems, this should happen automaticallypip install virtualenv
To parse xml articles, you'll need two system packages, libxml2 and libxsl. On ubuntu, install with sudo apt-get install libxml2 libxslt1-dev
Then, install the required python libraries with:
pip install -r requirements.txt
Windows Note: Running the above command will only partially work and will error on libxml. You must manually download and install it.
gem install genghisapp
, which is like PHPMyAdmin for MongoDB. genghisapp requires ruby / rubygems. You can install ruby
by following this guide and install gem
by downloading and installing from hereOnce everything is installed, you can run the crawler with
python main.py configs/simple-config.json
This will run the crawler with the simplest possible setup. It will crawl articles from the main CNN RSS feed and write them to a directory as JSON files.
To configure different behavior, you can specify a different configuration file. There are several pre-built configuration files in the configs/
directory. If none of them do what you want, consider making a new configuration. See configs/configuration.md
for more details.
vagrant up
vagrant ssh
The project directory is by default mapped to /vagrant in the virtual machine.