tequalsme/accumulo-wikisearch

Apache Accumulo Wikipedia Search Example

This project contains a sample application for ingesting and querying wikipedia data.

Prerequisites

Accumulo, Hadoop, and ZooKeeper must be installed and running
One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory You will want to grab the files with the link name of pages-articles.xml.bz2
Though not strictly required, the ingest will go more quickly if the files are decompressed:

$ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml

INSTRUCTIONS

Configuration and Build
-----------------------
1. Copy ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change contents to specify Accumulo information
   (For parallel ingest, instead copy ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml)
2. Copy webapp/src/main/resources/app.properties.example to webapp/src/main/resources/app.properties and change contents
   as done in step 1.
3. From the wikisearch directory, run mvn package

Ingest
------
1. Copy ingest/target/wikisearch-ingest-*.tar.gz to cluster and untar
2. Copy lib/wikisearch-ingest-*.jar and lib/protobuf-java-*.jar to $ACCUMULO_HOME/lib/ext
3. Run bin/ingest.sh with one argument: the name of the directory in HDFS where the wikipedia XML 
       files reside, this will start a MapReduce job to ingest the data into Accumulo
   (For parallel ingest, instead run ingest/bin/ingest_parallel.sh)

Query
-----
1. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the query/target/dependency directory:

    commons-jexl-*.jar
    guava-*.jar
    kryo-*.jar
    minlog-*.jar

2. Copy query/target/wikisearch-query-*.jar to $ACCUMULO_HOME/lib/ext
3. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example: 
        setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki

4. cd into webapp and run mvn jetty:run
5. Open a browser and goto: http://localhost:8080/accumulo-wikisearch/
   You can issue the queries using this user interface or via the REST url: <host>/accumulo-wikisearch/rest/query
6. Ctrl-C to stop the jetty container

tequalsme / accumulo-wikisearch

readme

Prerequisites

INSTRUCTIONS