Apache Accumulo Wikipedia Search Example
This project contains a sample application for ingesting and querying wikipedia data.
Though not strictly required, the ingest will go more quickly if the files are decompressed:
$ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml
Configuration and Build
-----------------------
1. Copy ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change contents to specify Accumulo information
(For parallel ingest, instead copy ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml)
2. Copy webapp/src/main/resources/app.properties.example to webapp/src/main/resources/app.properties and change contents
as done in step 1.
3. From the wikisearch directory, run mvn package
Ingest
------
1. Copy ingest/target/wikisearch-ingest-*.tar.gz to cluster and untar
2. Copy lib/wikisearch-ingest-*.jar and lib/protobuf-java-*.jar to $ACCUMULO_HOME/lib/ext
3. Run bin/ingest.sh with one argument: the name of the directory in HDFS where the wikipedia XML
files reside, this will start a MapReduce job to ingest the data into Accumulo
(For parallel ingest, instead run ingest/bin/ingest_parallel.sh)
Query
-----
1. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the query/target/dependency directory:
commons-jexl-*.jar
guava-*.jar
kryo-*.jar
minlog-*.jar
2. Copy query/target/wikisearch-query-*.jar to $ACCUMULO_HOME/lib/ext
3. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example:
setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
4. cd into webapp and run mvn jetty:run
5. Open a browser and goto: http://localhost:8080/accumulo-wikisearch/
You can issue the queries using this user interface or via the REST url: <host>/accumulo-wikisearch/rest/query
6. Ctrl-C to stop the jetty container