Broken pipe when using large dump files

rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.

https://pypi.org/project/pynonymizer/

MIT License

102 stars 38 forks source link

Broken pipe when using large dump files #95

Closed stoiven closed 2 years ago

stoiven commented 2 years ago

Describe the bug A clear and concise description of what the bug is. Running the pynonymizer app, it automatically crashes if we use a big file, e.g. >15GB

To Reproduce Steps to reproduce the behavior: Running the program with a large file will output the below:

  File "/home/ubuntu/.local/lib/python3.8/site-packages/pynonymizer/pynonymize.py", line 140, in pynonymize
    db_provider.restore_database(input_path)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pynonymizer/database/postgres/__init__.py", line 199, in restore_database
    batch_processor.close()

Expected behavior A clear and concise description of what you expected to happen. Should act as normal. This works fine with a file that's <5GB

rwnx commented 2 years ago

Hi! Are you sure this works as expected when you restore the dump with mysql manually? pynonymizer really isn't doing much other than piping the dumpfile into the mysql binary.

Normally when you get a broken pipe on the mysql cli it relates to the server's ability to handle large statements/a lot of data at once, e.g. max_allowed_packet or similar settings.

stoiven commented 2 years ago

Ah you're right! It was definitely the input! The dump was actually PostgreSQL custom database dump in binary, so it couldn't read properly. Sorry about that!

Additionally, I don't see an option to read from a custom dump? I've tried converting over to a .gz extension, but it still outputs the same error. (as above). The file itself is a custom file, and I'm not sure if it's a pynonymizer thing or some special flags you need to pass through?

rwnx commented 2 years ago

pynonymizer is written for logical database dumps running over to the CLI sql runner (for postgres, psql). If this is something we could be supporting, I'd be interested to hear a feature request!