usc-isi-i2 / dig-etl-engine

Download DIG to run on your laptop or server.
http://usc-isi-i2.github.io/dig/
MIT License
101 stars 39 forks source link

Creating a new project #253

Closed saggu closed 5 years ago

saggu commented 5 years ago

When a new project in created in myDIG, the following should happen at the backend.

Create a new hbase table for the project: <project_name>_catalog.

Catalog table's schema

  rowid: the id of the row in hbase, designed as: `<dataset>_doc_id`. This will allow to quickly fetch all         docs under that dataset.
  document: the cdr json document
  date_added: date when this document was added to the project
  date_processed: date when this document was scheduled to be processed by etk
  file_name: the user uploaded file this json belongs to
  dataset: dataset specified by user for this document
  status: NEW - 0 and SCHEDULED To Be Processed - 1 
  identifier: the id of the document

Create a new hbase table for the project for storing etk status for docs in this project: <project_name>_etk_status.

Schema:

rowid: row id for the row, design: <project_name>_<dataset>_doc_id. Allows us to quickly fetch all the documents for given project and under a specific dataset.
date_last_processed: date when the document was processed by etk(successfully or not)
status: 0 - etk error
        1 - sandpaper error
        2 - successfully processed by etk and send to the out kafka topic
added_by: dig_etl_engine or housekeeping
saggu commented 5 years ago

Added code to create the table when a project is created. Closing

saggu commented 5 years ago

Updated the hbase table schema, each project will have 2 tables, catalog and etk_status. Keeping etk_status separate helps in removing the tables easier when deleting a project

saggu commented 5 years ago

Implemented