myDIG is a tool to build pipelines that crawl the web, extract information, build a knowledge graph (KG) from the extractions and provide an easy to user interface to query the KG. The project web page is DIG.
You can install myDIG in a laptop or server and use it to build a domain specific search application for any corpus of web pages, CSV, JSON and a variety of other files.
myDIG uses Docker to make installation easy:
Install Docker and Docker Compose.
Configure Docker to use at least 6GB of memory. DIG will not work with less than 4GB and is unstable with less than 6GB.
On Mac and Windows, you can set the Docker memory in the Preferences menu of the Docker application. Details are in the Docker documentation pages (Mac Docker or Windows Docker). In Linux, Docker is built on LXC of kernel, the latest version of kernel and enough memory on host are required.
Clone this repository.
git clone https://github.com/usc-isi-i2/dig-etl-engine.git
myDIG stores your project files on your disk, so you need to tell it where to put the files. You provide this information in the .env
file in the folder where you installed myDIG. Create the .env
file by copying the example environment file available in your installation.
cp ./dig-etl-engine/.env.example ./dig-etl-engine/.env
After you create your .env
file, open it in a text editor and customize it. Here is a typical .env
file:
COMPOSE_PROJECT_NAME=dig
DIG_PROJECTS_DIR_PATH=/Users/pszekely/Documents/mydig-projects
DOMAIN=localhost
PORT=12497
NUM_ETK_PROCESSES=2
KAFKA_NUM_PARTITIONS=2
DIG_AUTH_USER=admin
DIG_AUTH_PASSWORD=123
COMPOSE_PROJECT_NAME
: leave this one alone if you only have one myDIG instance. This is the prefix to differentiate docker-compose instances.DIG_PROJECTS_DIR_PATH
: this is the folder where myDIG will store your project files. Make sure the directory exists. The default setting will store your files in ./mydig-projects
, so do mkdir ./mydig-projects
if you want to use the default folder.DOMAIN
: change this if you install on a server that will be accessed from other machines.PORT
: you can customize the port where myDIG runs.NUM_ETK_PROCESSES
: myDIG uses multi-processing to ingest files. Set this number according to the number of cores you have on the machine. We don't recommend setting it to more than 4 on a laptop.KAFKA_NUM_PARTITIONS
: partition numbers per topic. Set it to the same value as NUM_ETK_PROCESSES
. It will not affect the existing partition number in Kafka topics unless you drop the Kafka container (you will lose all data in Kafka topics).DIG_AUTH_USER, DIG_AUTH_PASSWORD
: myDIG uses nginx to control access. If you are working on Linux, do these additional steps:
chmod 666 logstash/sandbox/settings/logstash.yml
sysctl -w vm.max_map_count=262144
# replace <DIG_PROJECTS_DIR_PATH> to you own project path
mkdir -p <DIG_PROJECTS_DIR_PATH>/.es/data
chown -R 1000:1000 <DIG_PROJECTS_DIR_PATH>/.es
To set
vm.max_map_count
permanently, please update it in/etc/sysctl.conf
and reload sysctl settings bysysctl -p /etc/sysctl.conf
.
Move default docker installation (if docker runs out of memory) to a volume
sudo mv /var/lib/docker /path_with_more_space
sudo ln -s /path_with_more_space /var/lib/docker
To run myDIG do:
./engine.sh up
Docker commands acquire high privilege in some of the OS, add
sudo
before them. You can also run./engine.sh up -d
to run myDIG as a daemon process in the background. Wait a couple of minutes to ensure all the services are up.
To stop myDIG do:
./engine.sh stop
(Use /engine.sh down
to drop all containers)
Once myDIG is running, go to your browser and visit http://localhost:12497/mydig/ui/
Note: myDIG currently works only on Chrome
To use myDIG, look at the user guide
myDIG v2 is now in alpha, there are couple of big and incompatible changes.
DIG_PROJECTS_DIR_PATH/.*
../engine.sh down
, you need to delete DIG_PROJECTS_DIR_PATH/.kafka
then restart engine (if you change NUM_ETK_PROCESSES
).custom_etk_config.json
or additional_etk_config/*
any more, instead, generated ETK modules are in working_dir/generated_em
and additional modules are in working_dir/additional_ems
.working_dir/etk_worker_*.log
.ELK (Elastic Search, LogStash & Kibana) components had been upgraded to 5.6.4 and other services in myDIG also got update. What you need to do is:
docker-compose down
DIG_PROJECTS_DIR_PATH/.es
.You will lose all data and indices in previous Elastic Search and Kibana.
On 20 Oct 2017 there are incompatible changes in Landmark tool (1.1.0), the rules you defined will get deleted when you upgrade to the new system. Please follow these instructions:
DIG_PROJECTS_DIR_PATH/.landmark
DIG_PROJECTS_DIR_PATH/<project_name>/landmark_rules/*
There are also incompatible changes in myDIG webservice (1.0.11). Instead of crashing, it will show N/A
s in TLD table, you need to update the desired number.
http://localhost:12497/mydig/ui/
http://localhost:12497/es/
http://localhost:12497/kibana/
http://localhost:12497/kafka_manager/
# run with ache
./engine.sh +ache up
# run with ache and rss crawler in background
./engine.sh +ache +rss up -d
# stop containers
./engine.sh stop
# drop containers
./engine.sh down
In .env
file, add comma separated add-on names:
DIG_ADD_ONS=ache,rss
Then, simply do ./engine.sh up
. You can also invoke additional add-ons at run time: ./engine.sh +dev up
.
ache
: ACHE Crawler (coming soon).rss
: RSS Feed Crawler (coming soon).kafka-manager
: Kafka Manager.dev
: Development mode.COMPOSE_PROJECT_NAME=dig
DIG_PROJECTS_DIR_PATH=./../mydig-projects
DOMAIN=localhost
PORT=12497
NUM_ETK_PROCESSES=2
KAFKA_NUM_PARTITIONS=2
DIG_AUTH_USER=admin
DIG_AUTH_PASSWORD=123
DIG_ADD_ONS=ache
KAFKA_HEAP_SIZE=512m
ZK_HEAP_SIZE=512m
LS_HEAP_SIZE=512m
ES_HEAP_SIZE=1g
DIG_NET_SUBNET=172.30.0.0/16
DIG_NET_KAFKA_IP=172.30.0.200
# only works in development mode
MYDIG_DIR_PATH=./../mydig-webservice
ETK_DIR_PATH=./../etk
SPACY_DIR_PATH=./../spacy-ui
RSS_DIR_PATH=./../dig-rss-feed-crawler
If some of the docker images (which tagged latest
) in docker-compose file are updated, run docker-compose pull <service name>
first.
The data in kafka queue will be cleaned after two days. If you want to delete the data immediately, drop the kafka container.
If you want to run your own ETK config, name this file to custom_etk_config.json
and put it in DIG_PROJECTS_DIR_PATH/<project_name>/working_dir/
.
If you have additional ETK config files, please paste them into
DIG_PROJECTS_DIR_PATH/<project_name>/working_dir/additional_etk_config/
(create directory additional_etk_config
if
it's not there).
If you are using custom ETK config or additional etk configs, you need to take care of all file paths in these config
files. DIG_PROJECTS_DIR_PATH/<project_name>
will be mapped to /shared_data/projects/<project_name>
in docker, so make sure all the paths you used in config are start with this prefix.
If you want to clean up all ElasticSearch data, remove .es
directory in your DIG_PROJECTS_DIR_PATH
.
If you want to clean up all Landmark Tool's database data, remove .landmark
directory in your DIG_PROJECTS_DIR_PATH
. But this will make published rules untraceable.
On Linux, if you can not access docker network from host machine: 1. stop docker containers 2. do docker network ls
to find out id of dig_net
and find this id in ifconfig
, do ifconfig <interface id> down
to delete this network interface and restart docker service.
On Linux, if DNS does not work correctly in dig_net
, please refer to this post.
On Linux, solutions for potential Elastic Search problem can be found here.
If there's a docker network conflict, use docker network rm <network id>
to remove conflicting network.
POST /create_project
{
"project_name" : "new_project"
}
POST /run_etk
{
"project_name" : "new_project",
"number_of_workers": 4,
"input_offset": "seek_to_end", // optional
"output_offset": "seek_to_end" // optional
}
POST /kill_etk
{
"project_name" : "new_project",
"input_offset": "seek_to_end", // optional
"output_offset": "seek_to_end" // optional
}
.env
file from .env.example
and change the environment variables../engine.sh up
for sandbox version.docker-compose -f docker-compose-production.yml up
for production version.
dig_net
is the LAN in Docker compose.
build Nginx image:
docker build -t uscisii2/nginx:auth-1.0 nginx/.
build ETL image:
# git commit all changes first, then
./release_docker.sh tag
git push --tags
# update DIG_ETL_ENGINE_VERSION in file VERSION
./release_docker.sh build
./release_docker.sh push
Invoke development mode:
# clone a new etl to avoid conflict
git clone https://github.com/usc-isi-i2/dig-etl-engine.git dig-etl-engine-dev
# swith to dev branch or other feature branches
git checkout dev
# create .env from .env.example
# change `COMPOSE_PROJECT_NAME` in .env from `dig` to `digdev`
# you also need a new project folder
# run docker in dev branch
./engine.sh up
# run docker in dev mode (optional)
./engine.sh +dev up
auto_offset_resetedit
What to do when there is no initial offset in Kafka or if an offset is out of range:
bootstrap_servers
A list of URLs to use for establishing the initial connection to the cluster. This list should be in the form of host1:port1,host2:port2 These urls are just used for the initial connection to discover the full cluster membership (which may change dynamically) so this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
consumer_threads
Ideally you should have as many threads as the number of partitions for a perfect balance — more threads than partitions means that some threads will be idle
group_id
The identifier of the group this consumer belongs to. Consumer group is a single logical subscriber that happens to be made up of multiple processors. Messages in a topic will be distributed to all Logstash instances with the same group_id
topics
A list of topics to subscribe to, defaults to ["logstash"].