Every data analysis project starts with exploring and cleaning data. For tabular data, there are some tools available that facilitate data pre-processing, such as OpenRefine and Trifacta Wrangler. However, these tools one big disadvantage; they don't scale to really big data sets. Also, Trifacta Wrangler is not open source. For full text data, no tools for data cleanup exist.
The goal of Rig is to provide a data cleaning tool that is open source, supports both tabular data and full text pre-processing, and scales to big data sets. The frontend is a web interface that supports loading the data and creating workflows. Based on user input, it generates scripts that are send to the backend for execution. The backend relies on spark to ensure scalability.
Currently, we ware working on the use case of vocabulary cleanup for full text data. When working with full text data, pre-processing often consists of tokenizing texts (i.e., splitting them into words), lematizing or stemming the tokens, and performing some kind of filtering to remove typo's and other unwanted tokens. Rig should support customizing these tasks, and provide visualizations that facilitate vocabulary cleanup.
Install nodejs. See instructions at http://nodejs.org/. Make sure to install nodejs, nodejs-legacy and npm.
Install bower and grunt-cli globally:
sudo npm install -g bower grunt-cli
Fetch git repository
git clone https://github.com/nlesc-sherlock/Rig.git
Setup with bower
cd Rig
npm install
bower install
If you already have a installed the bower packages before, but need to update them for a new version of the code, run
bower update
Install requirements
pip install -r backend/requirements.txt
The webserver loads the data (this takes a few seconds). This means that the webserver needs to know where the data is.
.rigrcEXAMPLE
to .rigrc
path=/path/to/data/dir
to the directory that contains the data.Start development server & open browser
grunt serve
Changes made to code will automatically reload web page.
Run the webserver:
python backend/server.py
grunt test
Generates test report and coverage inside test/reports
folder.
Tests in Chrome can be run with
grunt e2e-local
To connnect to Sauce Labs use sauce connect program. Here you can find the details on how to install and run it.
Before tests can be run the sauce labs credentials must be setup
export SAUCE_USERNAME=<your sauce labs username>
export SAUCE_ACCESS_KEY=<your sauce labs access key>
Tests in Chrome, Firefox on Windows, Linux and OSX can be run with
grunt e2e-sauce
Travis-ci also runs end-to-end tests on sauce labs.
Note! Running grunt e2e-sauce
will undo all changes in app/
folder.