gyachdav / awoiaf

Extracts data from A wiki of Ice and Fire
http://awoiaf.westeros.org
0 stars 5 forks source link

Overview

This project contains the code base to extract data from the wiki portal http://awoiaf.westeros.org.

Iron Throne

Repo structure

 |
 |-src - main code base (python)
 |--lib - application modules
 |--sge - scripts to run jobs in parallel on a compute cluster
 |-Data - downloaded data

Quick Start

Prerequisites

Easy setup

To download the dependencies: set the correct PYTHONPATH, run . ./build.sh from the root directory of this repository. NB: The first . is necessary to export the PYTHONPATH to your current bash session.

Important: it is necessary to export the PYTONPATH every time you wish to run these tools. To do so you can either run the build script every time, or jump to the configuration section below.

Alternatively follow the next steps:

Dependencies

Next execute

$ python -m nltk.downloader punkt averaged_perceptron_tagger

NOTE : you may need to setup PYTHONPATH to include the path of installed modules if those were installed into non-default locations (for instance if you installed it into your user space).

Configuration

You will need to set up the PYTHONPATH to reference the lib folder

# in bash
AWOIAF_ROOT=/path/to/awoiaf/
export PYTHONPATH="${PYTHONPATH}:${AWOIAF_ROOT}/src/lib"

Hint

The scrtips in the scr folder are used as the main drivers that build the data repository. You can look at those scripts as an entry point into the code. Here is a bried description for each script:

Look at the scripts in the scr folder to see how the modules in this app can be used and

Running multiple jobs in parallel

As this project deals with processing 1000s of wiki pages it would make sense to use parallel processing to speed things up. If you have access to a compute cluster and to the sonofgrid scheduling system (formerly called SGE) then check the folder scr/sge for scripts and documentation on how to run paralllel jobs. If you want to schedule jobs using a different system (e.g. Hadoop YARN) then you will have to figure out how to do this yourself.