rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
101 stars 38 forks source link

MSSQL command timeout parameter #168

Closed jonstace-hsf closed 4 weeks ago

jonstace-hsf commented 2 months ago

I was having timeout issues when anonymizing larger tables (~30M rows) with MSSQL. So I've implemented a new --mssql-timeout parameter that sets the command timeout for the UPDATE statements that run when masking data.

I've done this as a MSSQL-specific parameter mainly because I don't have a decent test setup for MySQL or Postgres with large volumes of data. Also, the implementation relies on the way PyODBC picks up the 'timeout' property of the Connection when executing SQL, so it wouldn't work for the other two database types.

I hope this is useful to other people!

rwnx commented 1 month ago

hey, thanks for the PR, this seems useful. Is it possible to set this via the connection string overrides, or does it have to be a separate parameter?

jonstace-hsf commented 1 month ago

Hi, I spent quite some time trying to get this working without having to change the Pynonymizer code, but the MSSQL ODBC driver doesn't have a parameter for command timeout. There's one for connection timeout, but that didn't impact any SQL execution, it's just the timeout for opening the connection.

rwnx commented 1 month ago

Thanks for the context and the investigation. I'm going to look into whether we can apply this to the other providers too, but I think this is a great feature.

I want to think carefully before introducing a new provider specific feature.

I am thinking of making it general and using this sort of thing for other providers:

https://stackoverflow.com/questions/48407513/how-to-set-mysql-session-wait-timeout-from-the-command-line

jonstace-hsf commented 1 month ago

Hi there. Yes, it would make the most sense to have a consistent parameter that applies across the database types. This weekend, I had a go at setting up a similar sized dataset on MySQL and Postgres to test how they behave. I've used the Comments.xml file from a StackOverflow data dump, which has about 90M records. I've done a very simple strategy file (below) and I'm just running the ANONYMIZE_DB stage.

tables:
  comments:
    columns:
      text: paragraph

I've run this all through docker, and with the latest releases of each bit of software. Here are the results for MySQL. I left it running overnight, so I'm not sure how long it ran for, but I didn't get the error quickly. It was easily over 10 minutes before I left it.

docker run --mount type=bind,source=./so.yaml,target=/tmp/so.yaml --network datanet rwnxt/pynonymizer -s /tmp/so.yaml --db-host mysql --db-name so --db-user root --db-password XXX --only-step ANONYMIZE_DB
Skipped [CREATE_DB]: (Starting at [ANONYMIZE_DB])
Skipped [RESTORE_DB]: (Starting at [ANONYMIZE_DB])
[ANONYMIZE_DB]
creating seed table with 1 columns
Inserting seed data
Inserting seed data: 100%|██████████| 150/150 [00:08<00:00, 17.14rows/s]
Anonymizing 1 tables
Anonymizing comments:   0%|          | 0/1 [00:00<?, ?it/s]ERROR 1206 (HY000) at line 1: The total number of locks exceeds the lock table size
Error while anonymizing table comments

Postgres was a lot happier:

docker run --mount type=bind,source=./so.yaml,target=/tmp/so.yaml --network datanet rwnxt/pynonymizer -s /tmp/so.yaml --db-host pg --db-name so --db-user postgres --db-password XXX --db-type postgres --only-step ANONYMIZE_DB
Skipped [CREATE_DB]: (Starting at [ANONYMIZE_DB])
Skipped [RESTORE_DB]: (Starting at [ANONYMIZE_DB])
[ANONYMIZE_DB]
creating seed table with 1 columns
Inserting seed data
Inserting seed data: 100%|██████████| 150/150 [00:08<00:00, 16.73rows/s]
Anonymizing 1 tables
Anonymizing comments: 100%|██████████| 1/1 [20:45<00:00, 1245.10s/it]
dropping seed table
Skipped [DUMP_DB]: (Stopped at [ANONYMIZE_DB])
Skipped [DROP_DB]: (Stopped at [ANONYMIZE_DB])
Process complete!

I'm currently running this same test on MSSQL to make sure I'm doing a fare comparison. I'll add an update once that's run.

From my results so far, it looks to me like MySQL would have completed had I got the right server settings (https://stackoverflow.com/a/10912622), rather than using the defaults. Postgres was quite happy processing all that data, and didn't have a similar type of query timeout or lock timeout or anything.

Looking at the help for MySQL's wait_timeout parameter (https://dev.mysql.com/doc/refman/8.4/en/server-system-variables.html#sysvar_wait_timeout), it doesn't appear to be a similar thing to MSSQL's command timeout. Wait_timeout appears to be about idle connections, whereas MSSQL's command timeout is how long a running query is allowed to continue before being interrupted.

jonstace-hsf commented 1 month ago

Ok, a little later than intended... I got caught out by issue #167 for a bit, but I did a quick edit of that line of code in the docker container and was able to run this same test against MSSQL, all within the same environment. The results were not what I was expecting!

root@0a8218f806e3:~# pynonymizer -s /tmp/so.yaml --mssql-connection-string "server=mssql;database=so;UId=sa;Pwd=XXX;TrustServerCertificate=Yes" --db-type mssql --db-host mssql --only-step ANONYMIZE_DB Skipped [CREATE_DB]: (Starting at [ANONYMIZE_DB]) Skipped [RESTORE_DB]: (Starting at [ANONYMIZE_DB]) [ANONYMIZE_DB] creating seed table with 1 columns Inserting seed data Inserting seed data: 100%|█████████████████████████████████████████████████████████████████████████| 150/150 [00:01<00:00, 77.93rows/s] Anonymizing 1 tables Anonymizing comments: w[1/1]: 100%|█████████████████████████████████████████████████████████████████| 1/1 [4:35:13<00:00, 16513.76s/it] Dropping seed table Skipped [DUMP_DB]: (Stopped at [ANONYMIZE_DB]) Skipped [DROP_DB]: (Stopped at [ANONYMIZE_DB]) Process complete!

I was really expecting this test to result in a query timeout, especially given how long it took to run. Now, this is SQL 2022 on Linux, whereas the problems I had were on SQL 2019 on Windows. I'm a little confused with this as I've been using MSSQL since v7 professionally and query timeouts are a regular issue when running big processing queries. All of this testing was an attempt to prove that this is an MSSQL-specific issue.

Thinking about it, I had a number of tables with 1M+ rows in that I was anonymising, that brought up this issue in the first place, and not all of them would time out, so MSSQL wasn't behaving consistently even then. I'm going to bring that database back to the server where I was seeing the issue originally, to see if I can replicate it with this dataset.

jonstace-hsf commented 4 weeks ago

Following on from my previous messages, I've done more investigation into this. The StackOverflow data set test on MSSQL/Windows completed without the query timeout error. I then went back to my original use case, anonymizing a set of audit tables. I was able to isolate these tables and reproduce the query timeout error consistently. BUT. I moved them over to my mssql instance on docker and couldn't reproduce this issue. The conclusions that I've come to are that the query timeout is a driver 'feature' of the MSSQL ODBC driver on Windows, but it's not like-for-like when using the official MSSQL ODBC driver on Linux. Microsoft don't appear to have released the source code for the ODBC driver, but they have for the MSSQL JDBC driver and the Microsoft.Data.SqlClient C# client library. Both of these have a query timeout timer implemented within them, so I'm confident that it's all in the driver, not on the server.

Anyway, all that exposition is leading up to my view that this feature is only a need for MSSQL, rather than across MySQL, Postgres and MSSQL. It probably only matters when running Pynonymizer on Windows, but I'd really appreciate it if this went into an official release so I can run my scripts on different servers without having to manage a custom patched version.

rwnx commented 4 weeks ago

I'm happy with this as is, thank you so much for the detailed investigation, it makes this so much easier.

As a heads up the interface for those provider specific options is likely to change in the next major version!

rwnx commented 3 weeks ago

Hey, this work went into v2.4.0 today!