Process large result sets in batches to minimize the risk of data running stale

lightblue-platform / lightblue-migrator

GNU General Public License v3.0

3 stars 13 forks source link

Process large result sets in batches to minimize the risk of data running stale #394

Open paterczm opened 8 years ago

paterczm commented 8 years ago

Consistency checker processes data in time frames. The amount of rows/docs it fetches for a certain time frame can be very large and processing can take significant time. During that time, data in source may change, which can result in migrator overwriting good data with previous state. To minimize this risk, migrator should process large data sets in smaller batches. The downside of this solution is more load, but with a configurable batch size it should not be a problem.

dcrissman commented 8 years ago

I am definitely in support of trying to clean up or minimize the stale data issue. I am trying to think how this might work in practice though, considering that the data in both data sources is volatile. I am not even sure we could safely assume the number of modified source rows would remain the same in order to use a pagination approach, as many queries would use a lastModifiedDate field which could change if the row is updated multiple times.

Perhaps we could run an initial query to collect all the identity field values, split by the batch size, then query for each set of identities and update accordingly?

paterczm commented 8 years ago

Perhaps we could run an initial query to collect all the identity field values, split by the batch size, then query for each set of identities and update accordingly?

Query for each set of identities and include the lastModifiedDate condition as well, to avoid processing rows changed since initial query was run. Those will be processed in a future generated job. And we would need to make sure those batch queries are not done in the same transaction.