rheinwerk-verlag / pganonymize

A commandline tool for anonymizing PostgreSQL databases
http://pganonymize.readthedocs.io/
Other
42 stars 26 forks source link

Fix batch processing #26

Closed nurikk closed 3 years ago

nurikk commented 3 years ago

This pr fixes issue with processing huge tables. It calls import_data inside loop that calls fetchmany and clears data and StringIO buffers after each iteration

Depends on https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/25

hkage commented 3 years ago

Hi,

I tested your changes against a ~1,6 GB sized database and tables with at most ~370k data entries (per table). The anonymization breaks at some point with an OutOfMemory error. Have you discovered similar issues?

Im an running the anonymization with Python 3.9 in a virtualenv and on a Ubuntu Linux 21.04. with 24 GB of RAM.

postgres=# SELECT version();
                                                           version                                                            
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.7 (Ubuntu 10.7-0ubuntu0.18.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.2.0-7ubuntu1) 8.2.0, 64-bit
(1 row)
Anonymizing |█                               | 10000/292059Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 18, in <module>
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 40, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 83, in build_and_then_import_data
    import_data(connection, columns, table, primary_key, data)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 158, in import_data
    cursor.execute('DROP TABLE %s;' % temp_table)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.
nurikk commented 3 years ago

Hi,

I tested your changes against a ~1,6 GB sized database and tables with at most ~370k data entries (per table). The anonymization breaks at some point with an OutOfMemory error. Have you discovered similar issues?

Im an running the anonymization with Python 3.9 in a virtualenv and on a Ubuntu Linux 21.04. with 24 GB of RAM.

postgres=# SELECT version();
                                                           version                                                            
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.7 (Ubuntu 10.7-0ubuntu0.18.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.2.0-7ubuntu1) 8.2.0, 64-bit
(1 row)
Anonymizing |█                               | 10000/292059Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 18, in <module>
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 40, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 83, in build_and_then_import_data
    import_data(connection, columns, table, primary_key, data)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 158, in import_data
    cursor.execute('DROP TABLE %s;' % temp_table)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

No, i'm not getting such issues. Without proposed changes I assume everything works fine?

hkage commented 3 years ago

Hi, I tested your changes against a ~1,6 GB sized database and tables with at most ~370k data entries (per table). The anonymization breaks at some point with an OutOfMemory error. Have you discovered similar issues? Im an running the anonymization with Python 3.9 in a virtualenv and on a Ubuntu Linux 21.04. with 24 GB of RAM.

postgres=# SELECT version();
                                                           version                                                            
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.7 (Ubuntu 10.7-0ubuntu0.18.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.2.0-7ubuntu1) 8.2.0, 64-bit
(1 row)
Anonymizing |█                               | 10000/292059Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 18, in <module>
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 40, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 83, in build_and_then_import_data
    import_data(connection, columns, table, primary_key, data)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 158, in import_data
    cursor.execute('DROP TABLE %s;' % temp_table)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

No, i'm not getting such issues. Without proposed changes I assume everything works fine?

Yes, as soon as I change to the development branch the anonymization works fine and completes after a couple of minutes.

I haven't take a deeper look into the max_locks_per_transaction settings from PostgreSQL, but it seems like the number of locks during the anonymization exceeds PostgreSQL's default. I tried to split the cursor operations into multiple transactions, but that resulted in errors when deleting the temporary tables (at https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/development/pganonymizer/utils.py#L157).

nurikk commented 3 years ago

I'll fix this issue

hkage commented 3 years ago

I'll fix this issue

Thanks. I can support you with additional debug / system information if that helps you.

nurikk commented 3 years ago

Hm, I can't reproduce it on my test db Can you try to test against this pr? It's already includes changes from this PR and might resolves psycopg2.errors.OutOfMemory: out of shared memory issue

nurikk commented 3 years ago

Superseded by https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/27

hkage commented 3 years ago

Hm, I can't reproduce it on my test db Can you try to test against this pr? It's already includes changes from this PR and might resolves psycopg2.errors.OutOfMemory: out of shared memory issue

Sure, I will check it with PR 27. Thanks!