rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
102 stars 38 forks source link

Anonymizer process control does not work #97

Closed armorKing11 closed 2 years ago

armorKing11 commented 2 years ago

Describe the bug When i run pynonymizer to use its process control system , it fails to go through the default process control flow ,ie, as per pynonymize.py , it is supposed to go through the below flow starting with CREATE_DB is per my understanding

    logger.info(actions.summary(ProcessSteps.CREATE_DB))
    if not actions.skipped(ProcessSteps.CREATE_DB):
        db_provider.create_database()

    logger.info(actions.summary(ProcessSteps.RESTORE_DB))
    if not actions.skipped(ProcessSteps.RESTORE_DB):
        db_provider.restore_database(input_path)

    logger.info(actions.summary(ProcessSteps.ANONYMIZE_DB))
    if not actions.skipped(ProcessSteps.ANONYMIZE_DB):
        db_provider.anonymize_database(strategy)

    logger.info(actions.summary(ProcessSteps.DUMP_DB))
    if not actions.skipped(ProcessSteps.DUMP_DB):
        db_provider.dump_database(output_path)

    logger.info(actions.summary(ProcessSteps.DROP_DB))
    if not actions.skipped(ProcessSteps.DROP_DB):
        db_provider.drop_database()

But in reality when i run it as follows , it does not create the db and fails

To Reproduce Issue1:

 pynonymizer.run(input_path="main_sys.sql", strategyfile_path="strategy_file1.yaml",
                        db_host='< host >', db_name = 'main_sys', db_password='<password>', output_path='main_sys_anonymized.sql')

Does this imply that it did not run CREATE_DB by default, but instead ran RESTORE_DB first , since logs state restoring followed by the logs stating Table 'main_sys.admins' does not exist ? So i tried to explicitly start from CREATE_DB step as shown in Issue2 below Error log:

mysql: [Warning] Using a password on the command line interface can be insecure.
Restoring: 100%|██████████| 233k/233k [00:00<00:00, 658kB/s]
["UPDATE `user` SET `first_name` = ('hello'),`last_name` = ('test');"]
["UPDATE `user` SET `first_name` = ('hello'),`last_name` = ('test');"]
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]    mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 1146 (42S02) at line 1: Table 'main_sys.admins' doesn't exist
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/test/Documents/tools/pnonymizer/mask.py", line 3, in <module>
        pynonymizer.run(input_path="main_sys.sql", strategyfile_path="strategy.yaml",
                File "/Users/test/Documents/DataProcessor/venv/lib/python3.9/site-packages/pynonymizer/pynonymize.py", line 147, in pynonymize
db_provider.anonymize_database(strategy)
File "/Users/test/Documents/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/__init__.py", line 159, in anonymize_database
self.__runner.db_execute(statements)
File "/Users/test/Documents/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 131, in db_execute
self.__mask_subprocess_error(error)
File "/Users/test/Documents/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 81, in __mask_subprocess_error
raise error from None
File "/Users/test/Documents/venv/lib/python3.9/site-packages/pynonymizer/database/mysql/execution.py", line 124, in db_execute
subprocess.check_output(
        File "/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
        subprocess.CalledProcessError: Command '['mysql', '-h', '127.0.0.1', '-P', '3306', '-u', 'test', '-p******']' returned non-zero exit status 1.

Issue2:

 pynonymizer.run(input_path="main_sys.sql", strategyfile_path="strategy_file1.yaml",
                        db_host='< host >', db_name = 'main_sys', db_password='<password>', output_path='main_sys_anonymized.sql', 
start_at_step='CREATE_DB')

When i tried to use start_at_step='CREATE_DB' in pynonymizer.run() to understand and change the process control behaviour by ensuring that the database gets created to prevent the above error , the following below error happens which implies that the it is attempting to run RESTORE_DB and than CREATE_DB causing the below failure even though it is supposed to first CREATE_DB . Error log:

Restoring: 100%|██████████| 307k/307k [00:00<00:00, 1.13MB/s]
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]    mysql: [Warning] Using a password on the command line interface can be insecure
["UPDATE `user` SET `last_name` = ( 'test' );"]
ERROR 1146 (42S02) at line 1: Table 'main_sys.admins' doesn't exist
Anonymizing user:   0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
    subprocess.check_output(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mysql', '-h', '127.0.0.1', '-P', '3306', '-u', 'root', '-p******']' returned non-zero exit status 1.

Can you please advise how i can i achieve the basic process flow via the python script ,ie,

1. CREATE_DB
2. RESTORE_DB
3. ANONYMIZE_DB
4. DUMP_DB

. Thank you @rwnx The only way i am able to use the tool in a step by step manner is to specify the only_step by calling pynonymizer.run for each of the below values CREATE_DB, RESTORE_DB,ANONYMIZE_DB,DUMP_DB

Expected behavior As per documentation and code it should go through the steps in the below order as default process control behaviour ,ie,

    logger.info(actions.summary(ProcessSteps.CREATE_DB))
    if not actions.skipped(ProcessSteps.CREATE_DB):
        db_provider.create_database()

    logger.info(actions.summary(ProcessSteps.RESTORE_DB))
    if not actions.skipped(ProcessSteps.RESTORE_DB):
        db_provider.restore_database(input_path)

    logger.info(actions.summary(ProcessSteps.ANONYMIZE_DB))
    if not actions.skipped(ProcessSteps.ANONYMIZE_DB):
        db_provider.anonymize_database(strategy)

    logger.info(actions.summary(ProcessSteps.DUMP_DB))
    if not actions.skipped(ProcessSteps.DUMP_DB):
        db_provider.dump_database(output_path)

    logger.info(actions.summary(ProcessSteps.DROP_DB))
    if not actions.skipped(ProcessSteps.DROP_DB):
        db_provider.drop_database()

Additional context

rwnx commented 2 years ago

However, there's no reason to think it didnt create the database. It's just that it doesn't log anything when it does. Remember that the schema for the database comes from the dumpfile and not the CREATE_DB step.

I understand that the table doesn't exist at anonymization-time, though. If i was looking into this, I'd start by making absolutely sure that the table in question is in the dumpfile/sql and that the strategyfile is referencing it correctly, including any options like schema, etc.

To be clear, have you tried restoring this dumpfile manually, and does that work (i.e is the table present there?).

armorKing11 commented 2 years ago

The pynonymizer works correctly if you use the only_step parameter in the run() call for all the steps specified individually which is my current workaround ,ie,

1. CREATE_DB
2. RESTORE_DB
3. ANONYMIZE_DB
4. DUMP_DB

What is does not appear to do is go through its default process control as you specified in the readme and code as i mentioned in my above comments. I will look into enabling a logger to access the pynonymizer logs at a hopefully higher verbosity level to investigate the issue more over the weekend Thanks for the quick response @rwnx

rwnx commented 2 years ago

At the moment, to go further on this I think we need to put together a replication case we can test against.

The information provided here would indicate something seriously wrong with the normal flow of the tool, which we definitely don't want, but likewise, doesn't seem to happen in any of our tests.

Can you give any more info about your use case?

How big is the dumpfile you're restoring? ( My thinking is it might be related to the fix in #98 ). If you try against the current master, that could be useful. alternatively, wait for this change to ship in the next release !

armorKing11 commented 2 years ago

The dump file i am restoring is only about 50 MB and i was using the code in master , but looking at the version ( i am using 1.21.3) , i assume the #98 fix is available in 1.22.0 ? The release history at https://pypi.org/project/pynonymizer/1.22.0/#history does not tell me what fixes are included in that release . Can you please confirm if the fix is present in 1.22.0 release , @rwnx ?

rwnx commented 2 years ago

Hi, yes, that's present in v1.22.0.

If you want to know what's in each release, you can check out the CHANGELOG or compare git tags.

armorKing11 commented 2 years ago

Thanks @rwnx !!

rwnx commented 2 years ago

I'm assuming this was fixed for you, so am closing this issue. Let me know if that's not the case.