sunitparekh / data-anonymization

Want to use production data for testing, data-anonymization can help you.
MIT License
459 stars 92 forks source link

Run largest tables first when running in parallel #58

Closed JasonBarnabe closed 6 years ago

JasonBarnabe commented 6 years ago

When using execution_strategy DataAnon::Parallel::Table, the tables will be processed in the order they are defined. If the number of tables is larger than the number of processes, then some tables have to wait for others to be completed.

We should try to make the table that takes longest to finish processing start processing first. This will make the entire run complete more quickly. In general, more rows = more time to process. So run the biggest tables first.

coveralls commented 6 years ago

Coverage Status

Coverage remained the same at 93.794% when pulling a6f28eefe314b7dd4d092cca2ce83e4ee6336496 on kickbooster:parallel-order into db4f509dd9448fb2cfd25e4bb15c3d9116daead0 on sunitparekh:master.

sunitparekh commented 6 years ago

I liked your idea, however, I needed to give predictable sequence of table execution to the tool considering constraints like Foreign Keys in database. So if we change table orders it might create problems. So idea is, let user define the order in which he likes to process the tables as needed.

JasonBarnabe commented 6 years ago

If you are using execution_strategy DataAnon::Parallel::Table you are not guaranteed predictable execution sequence anyway.

sunitparekh commented 6 years ago

I understand your point of view that since we are running in parallel it might have similar problems as well. However, my thought is, atleast user has option to reorder their tables in DSL as needed. E.g. as per your need, you could change order of table in DSL based on size of table calculation done outside of tool.

I am not very comfortable changing the order based on size of the table and taking away control from users. Let me still think through and bounce idea with few colleagues in office.

On Fri, 30 Mar 2018 at 20:30 Jason Barnabe notifications@github.com wrote:

If you are using execution_strategy DataAnon::Parallel::Table you are not guaranteed predictable execution sequence anyway.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/sunitparekh/data-anonymization/pull/58#issuecomment-377543051, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK2uk2vc82K6ye105Nz4FMNKO7JK4qCks5tjkimgaJpZM4TAwVy .