sunitparekh / data-anonymization

Want to use production data for testing, data-anonymization can help you.
MIT License
454 stars 94 forks source link

Wrap updates in transaction to avoid one BEGIN/COMMIT per row #56

Open JasonBarnabe opened 6 years ago

JasonBarnabe commented 6 years ago

Given n rows processed in m batches, currently 3n statements are sent to the DB for updates: BEGIN, UPDATE, COMMIT. If the updates were wrapped in a transaction, then it would only send n + 2m updates.

On a local postgres table with 40000 rows, batch size 1000, anonymizing a single email field.

Before changes: 2m 52s With transactions: 2m 26s (15% faster)

I'm not sure if this would have undesired effects for others, so maybe this should be configurable?

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-2.3%) to 91.541% when pulling d4f1c305fd747a8125e0b050a9cc5ced9947ec40 on kickbooster:transactions into db4f509dd9448fb2cfd25e4bb15c3d9116daead0 on sunitparekh:master.

JasonBarnabe commented 6 years ago

Note that #57 avoids transactions altogether which brings the number of statements to n.