usc-isi-i2 / dig-etl-engine

Download DIG to run on your laptop or server.
http://usc-isi-i2.github.io/dig/
MIT License
101 stars 39 forks source link

Starting up myDIG for the very first time #254

Open saggu opened 5 years ago

saggu commented 5 years ago

Create hbase table dataset_view, if it does not exist.

This will be used for TLD View in mydig frontend.

Schema:

rowid: design- <project_name>_<dataset>. Allows us to quickly fetch the total number of documents in each project and for each dataset
total_docs: number of documents in <dataset> in <project_name>

Note : We are not going to track desired number of docs. It'll only exist as front end concept and based on what the user has entered, that many documents will be fetched from hbase and processed.

saggu commented 5 years ago

Implemented

saggu commented 5 years ago

Adding desired to hbase as well. The updated schema looks like this

Schema:

rowid: design- <project_name>_<dataset>. Allows us to quickly fetch the total number of documents in each project and for each dataset
total_docs: number of documents in <dataset> in <project_name>
desired: number of desired docs in elasticsearch for <dataset> in <project_name>
saggu commented 5 years ago

We need to track total docs added to kafka while etk processing, updated schema:

Schema:

rowid: design- <project_name>_<dataset>. Allows us to quickly fetch the total number of documents in each project and for each dataset
total_docs: number of documents in <dataset> in <project_name>
desired: number of desired docs in elasticsearch for <dataset> in <project_name>
added_docs: total number of docs added to kakfa for <dataset> in <project_name>