usc-isi-i2 / dig-etl-engine

Download DIG to run on your laptop or server.
http://usc-isi-i2.github.io/dig/
MIT License
101 stars 39 forks source link

Processing documents #256

Open saggu opened 5 years ago

saggu commented 5 years ago

User has set the desired number of documents to be processed as n for k datasets

for each dataset in k datasets:
   - scan hbase table <project_name>_catalog for n documents
   - update date_processed and status for those n documents in the table <project_name>_catalog
   - Add n documents to the _in topic for them to be processed by etk
   - depending on the status as reported by etk and/or sandpaper, update/insert row in the table etk_status for n documents