smithoss / gonymizer

Gonymizer: A Tool to Anonymize Sensitive PostgreSQL Data Tables for Use in QA and Testing
Apache License 2.0
154 stars 34 forks source link

Mechanism for reducing dump size by limiting rows per table? #52

Open mateodelnorte opened 5 years ago

mateodelnorte commented 5 years ago

Thanks for this tool. It looks pretty great.

I'd like to both anonymize my data as well as decrease the size of the overall database size. Is there a mechanism such that I could specify a maximum number of records for a particular table, and delete any records prior to that maximum set?

junkert commented 5 years ago

Thanks @mateodelnorte for the question and request.

Unfortunately there is no way to do this currently, but it would be a rather easy to modify the generator handle row counts when processing the dump file. I'll see if I can find the time in the next couple of weeks to add this feature. I know we will need this eventually here at SmithRx as well.

On another note you can also minimize the size of your database by using the --exclude-table and --exclude-table-data options which allow you to exclude full tables (do not include DDL) or just the table data (keep DDL, but ignore data in the table) from the dump process.

For example, we have a table that is denormalized when first added to our database. This table is is very large and sparse until we process it and normalize the records. We choose to ignore this table's data during the dump process since we only care about the normalized data when testing.

junkert commented 4 years ago

@mateodelnorte looking into implementing this soon. Does the solution described above work for your use?

If we limit tables by size then we will not be able to keep foreign keys consistent between tables. If you do not care about foreign keys existing then the solution above should work.

If we want to ensure foreign key consistency with size limiting we will need to rewrite a large portion of the generator which may take a lot of time and would probably require a new product version release.

mateodelnorte commented 4 years ago

Thanks @junkert. I actually created a simple db-trim tool to do the same as is suggested above. Would be happy to use it as a part of this tool and not have to maintain mine. Overall, something that checks referential integrity would also be great. But, I'm sure you're thinking of that as well and recognize the increased complexity.