We have seen in the past that when there are a bunch of datasets and documents are uploaded periodically to mydig, some of the documents which are supposed to be processed never reach etk.
This could potentially be due to 2 issues:
1. dead etk processes: etk processes on the machine have been killed by the OS or timed out and uploaded docs in kafka hang there in limbo for eternity.
2. kafka issues: either they were timed out before they could be processed or kafka has `lost` them
Solution:
We will have a housekeeping module, which will run once in a while( frequency: TBD). This module will scan the <project_name>_catalog table for documents (m) which were processed in the last 24 hours. and also scan the table etk_status for documents(n) for given project and processed in the last 24 hours.
Ideally m == n.
If n < m:
find those n-m documents and add them again to kafka topic
We have seen in the past that when there are a bunch of datasets and documents are uploaded periodically to mydig, some of the documents which are supposed to be processed never reach etk.
This could potentially be due to 2 issues:
Solution:
We will have a housekeeping module, which will run once in a while( frequency: TBD). This module will scan the
<project_name>_catalog
table for documents (m) which wereprocessed
in the last 24 hours. and also scan the tableetk_status
for documents(n) for given project and processed in the last 24 hours.Update the value in the field
added_by