usc-isi-i2 / dig-etl-engine

Download DIG to run on your laptop or server.
http://usc-isi-i2.github.io/dig/
MIT License
101 stars 39 forks source link

Housekeeping: keeping track of lost documents in kafka #257

Open saggu opened 6 years ago

saggu commented 6 years ago

We have seen in the past that when there are a bunch of datasets and documents are uploaded periodically to mydig, some of the documents which are supposed to be processed never reach etk.

This could potentially be due to 2 issues:

1. dead etk processes: etk processes on the machine have been killed by the OS or timed out and uploaded docs in kafka hang there in limbo for eternity.
2. kafka issues: either they were timed out before they could be processed or kafka has `lost` them

Solution:

We will have a housekeeping module, which will run once in a while( frequency: TBD). This module will scan the <project_name>_catalog table for documents (m) which were processed in the last 24 hours. and also scan the table etk_status for documents(n) for given project and processed in the last 24 hours.

Ideally m == n.
If n < m:
    find those n-m documents and add them again to kafka topic

Update the value in the field added_by