usc-isi-i2 / dig-etl-engine

Download DIG to run on your laptop or server.
http://usc-isi-i2.github.io/dig/
MIT License
101 stars 39 forks source link
crawling etl-framework etl-pipeline information-extraction information-visualization search-engine

myDIG Domain-Specific Search

myDIG is a tool to build pipelines that crawl the web, extract information, build a knowledge graph (KG) from the extractions and provide an easy to user interface to query the KG. The project web page is DIG.

You can install myDIG in a laptop or server and use it to build a domain specific search application for any corpus of web pages, CSV, JSON and a variety of other files.

Installation

Installation Requirements

Installation Instructions

myDIG uses Docker to make installation easy:

Install Docker and Docker Compose.

Configure Docker to use at least 6GB of memory. DIG will not work with less than 4GB and is unstable with less than 6GB.

On Mac and Windows, you can set the Docker memory in the Preferences menu of the Docker application. Details are in the Docker documentation pages (Mac Docker or Windows Docker). In Linux, Docker is built on LXC of kernel, the latest version of kernel and enough memory on host are required.

Clone this repository.

git clone https://github.com/usc-isi-i2/dig-etl-engine.git

myDIG stores your project files on your disk, so you need to tell it where to put the files. You provide this information in the .env file in the folder where you installed myDIG. Create the .env file by copying the example environment file available in your installation.

cp ./dig-etl-engine/.env.example ./dig-etl-engine/.env

After you create your .env file, open it in a text editor and customize it. Here is a typical .env file:

COMPOSE_PROJECT_NAME=dig
DIG_PROJECTS_DIR_PATH=/Users/pszekely/Documents/mydig-projects
DOMAIN=localhost
PORT=12497
NUM_ETK_PROCESSES=2
KAFKA_NUM_PARTITIONS=2
DIG_AUTH_USER=admin
DIG_AUTH_PASSWORD=123

If you are working on Linux, do these additional steps:

chmod 666 logstash/sandbox/settings/logstash.yml
sysctl -w vm.max_map_count=262144

# replace <DIG_PROJECTS_DIR_PATH> to you own project path
mkdir -p <DIG_PROJECTS_DIR_PATH>/.es/data
chown -R 1000:1000 <DIG_PROJECTS_DIR_PATH>/.es

To set vm.max_map_count permanently, please update it in /etc/sysctl.conf and reload sysctl settings by sysctl -p /etc/sysctl.conf.

Move default docker installation (if docker runs out of memory) to a volume

sudo mv /var/lib/docker /path_with_more_space
sudo ln -s /path_with_more_space /var/lib/docker

To run myDIG do:

./engine.sh up

Docker commands acquire high privilege in some of the OS, add sudo before them. You can also run ./engine.sh up -d to run myDIG as a daemon process in the background. Wait a couple of minutes to ensure all the services are up.

To stop myDIG do:

./engine.sh stop

(Use /engine.sh down to drop all containers)

Once myDIG is running, go to your browser and visit http://localhost:12497/mydig/ui/

Note: myDIG currently works only on Chrome

To use myDIG, look at the user guide

Upgrade Issues (12 June 2018)

myDIG v2 is now in alpha, there are couple of big and incompatible changes.

Upgrade Issues (16 Nov 2017)

ELK (Elastic Search, LogStash & Kibana) components had been upgraded to 5.6.4 and other services in myDIG also got update. What you need to do is:

You will lose all data and indices in previous Elastic Search and Kibana.

Upgrade Issues (20 Oct 2017)

On 20 Oct 2017 there are incompatible changes in Landmark tool (1.1.0), the rules you defined will get deleted when you upgrade to the new system. Please follow these instructions:

There are also incompatible changes in myDIG webservice (1.0.11). Instead of crashing, it will show N/As in TLD table, you need to update the desired number.

Access Endpoints:

Run with Add-ons

From command line

# run with ache
./engine.sh +ache up

# run with ache and rss crawler in background
./engine.sh +ache +rss up -d

# stop containers
./engine.sh stop

# drop containers
./engine.sh down

From env file

In .env file, add comma separated add-on names:

DIG_ADD_ONS=ache,rss

Then, simply do ./engine.sh up. You can also invoke additional add-ons at run time: ./engine.sh +dev up.

Add-on list

Complete .env variable list

COMPOSE_PROJECT_NAME=dig
DIG_PROJECTS_DIR_PATH=./../mydig-projects
DOMAIN=localhost
PORT=12497
NUM_ETK_PROCESSES=2
KAFKA_NUM_PARTITIONS=2
DIG_AUTH_USER=admin
DIG_AUTH_PASSWORD=123
DIG_ADD_ONS=ache

KAFKA_HEAP_SIZE=512m
ZK_HEAP_SIZE=512m
LS_HEAP_SIZE=512m
ES_HEAP_SIZE=1g

DIG_NET_SUBNET=172.30.0.0/16
DIG_NET_KAFKA_IP=172.30.0.200

# only works in development mode
MYDIG_DIR_PATH=./../mydig-webservice
ETK_DIR_PATH=./../etk
SPACY_DIR_PATH=./../spacy-ui
RSS_DIR_PATH=./../dig-rss-feed-crawler

Advanced operations and solutions to known issues

Development Instructions

Manager's endpoints

Docker compose

Ports allocation in dig_net

dig_net is the LAN in Docker compose.

Docker commands for development

build Nginx image:

docker build -t uscisii2/nginx:auth-1.0 nginx/.

build ETL image:

# git commit all changes first, then
./release_docker.sh tag
git push --tags
# update DIG_ETL_ENGINE_VERSION in file VERSION
./release_docker.sh build
./release_docker.sh push

Invoke development mode:

# clone a new etl to avoid conflict
git clone https://github.com/usc-isi-i2/dig-etl-engine.git dig-etl-engine-dev

# swith to dev branch or other feature branches
git checkout dev

# create .env from .env.example
# change `COMPOSE_PROJECT_NAME` in .env from `dig` to `digdev`
# you also need a new project folder

# run docker in dev branch
./engine.sh up

# run docker in dev mode (optional)
./engine.sh +dev up

kafka input parameters of interest for Logstash

auto_offset_resetedit

What to do when there is no initial offset in Kafka or if an offset is out of range:

bootstrap_servers

A list of URLs to use for establishing the initial connection to the cluster. This list should be in the form of host1:port1,host2:port2 These urls are just used for the initial connection to discover the full cluster membership (which may change dynamically) so this list need not contain the full set of servers (you may want more than one, though, in case a server is down).

consumer_threads

Ideally you should have as many threads as the number of partitions for a perfect balance — more threads than partitions means that some threads will be idle

group_id

The identifier of the group this consumer belongs to. Consumer group is a single logical subscriber that happens to be made up of multiple processors. Messages in a topic will be distributed to all Logstash instances with the same group_id

topics

A list of topics to subscribe to, defaults to ["logstash"].