Closed hamishgibbs closed 3 years ago
Filtering already-processed records should be accomplished with row hashes which persist between cleaning runs.
Output row-wise hashes in each ingestion with date of processing to config folder.
config
To filter recognised records - check hashes for all records with a date != current date.
This will remove the dependence on prop_ids and will force reprocessing on dataset changes - both improvements of the current routine.
prop_ids
Filtering already-processed records should be accomplished with row hashes which persist between cleaning runs.
Output row-wise hashes in each ingestion with date of processing to
config
folder.To filter recognised records - check hashes for all records with a date != current date.
This will remove the dependence on
prop_ids
and will force reprocessing on dataset changes - both improvements of the current routine.