smithoss / gonymizer

Gonymizer: A Tool to Anonymize Sensitive PostgreSQL Data Tables for Use in QA and Testing
Apache License 2.0
149 stars 34 forks source link

Disk space usage #100

Open mcg opened 3 years ago

mcg commented 3 years ago

Building out a POC for possibly using gonymizer. It appears that during a dump/process/upload run, it requires three times the disk space. Storage for the dump, then the intermediary partial files and for resultant file as the partials are combined.

Is this correct and anyway to avoid using this much storage?

junkert commented 12 months ago

There definitely is, but it would require a major refactor to the application.

The project was built using a smaller (< 100GB) database so we did not build in space constraints into the design of this application. One of our objectives for this application was to anonymize the database only through files (outside the DB) and then make it easy to copy where ever we liked (laptop, staging, etc).

A common design that exists for anonymization you will see elsewhere is to anonymize data inside the database where the real data exists and then only dump the temporary anonymized data tables to a file, and finally removing the anonymized temporary tables after. This method, however, does create load and can take up 2x the space constraint on disk, but could severely impact the database performance on the main host (depending on hardware). It is possible to offload this increase in load and space to a replica instead.

One way I think we could improve disk space usage, but sacrifice some CPU, is to have the option is to compress all input and output files as they are being written to disk and every time we read from disk. I feel this could be an improvement without having to redesign the application.

How big is the database you are looking to anonymize?