User has set the desired number of documents to be processed as n for k datasets
for each dataset in k datasets:
- scan hbase table <project_name>_catalog for n documents
- update date_processed and status for those n documents in the table <project_name>_catalog
- Add n documents to the _in topic for them to be processed by etk
- depending on the status as reported by etk and/or sandpaper, update/insert row in the table etk_status for n documents
User has set the desired number of documents to be processed as
n
fork
datasets