rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.

https://pypi.org/project/pynonymizer/

MIT License

105 stars 38 forks source link

feat: allow definition of sql type #27

Closed babariviere closed 3 years ago

babariviere commented 3 years ago

New feature

Add a new field to the following column strategies:

literal
fake_update

It allows user to define it's own SQL type. This is to override a limitation of types.

Use case

With the following database schema in PostgreSQL:

CREATE TABLE foo (
    id uuid PRIMARY KEY,
    sensitive_data uuid
);

If I write the following strategy file:

tables:
  foo:
    columns:
      sensitive_data: uuid4

We got the following error:

ERROR:  column "sensitive_data" is of type uuid but expression is of type character varying

Solution

To overcome this issue, we could fix the strategy file by writing this:

tables:
  foo:
    columns:
      sensitive_data:
        type: fake_update
        fake_type: uuid4
        sql_type: uuid # <--- new field

In the seed table, we will have the correct data type to update foo table.

babariviere commented 3 years ago

I have a couple of things in the implementation to discuss:

I want to be sure of the use-case. Can you think of any other cases(like this one) where you'd want a value from faker that comes out as a string, but you store it differently in a database? UUID is an easy one, but for others i'm coming up a bit short. I'd like to put these in the documentation so everyone understands what the new feature is all about!

I am not really sure for the other use cases. I haven't tested it but maybe these types can collide:

json and jsonb
geometric data types
postgis data types

Also, as the type is chosen by pynonymizer, the user is not able to create BIGINT for example.

But that depends on the use cases and I am not able to find others apart from mine. 😅

Maybe we could modify the _get_column_subquery to use sql_type to cast the text field to something else?

Something along the lines of:
    elif column_strategy.strategy_type == UpdateColumnStrategyTypes.FAKE_UPDATE:
        return f"( SELECT {column_strategy.qualifier}::{column_strategy.sql_type} FROM {seed_table_name} WHERE \"updatetarget\"=\"updatetarget\" ORDER BY RANDOM() LIMIT 1)"

Yeah sure, I will update my PR then 😄

rwnx commented 3 years ago

Also, as the type is chosen by pynonymizer, the user is not able to create BIGINT for example.

This is legit. What i've tried to do in the seed table is pick types that meet or exceed what faker/python gives us, and then worry about the update statement - I think in the case of bigint we could just make that the default and then have it cast "down".

In the case of complex types like geometric/json, whos equivalent would be something like a class or a dict... I'm not sure, in short. I don't want to add a big serialization concern to the anonymization part just to support them, when the literal type and this feature can get you most of the way there.

I've added on: pull_request to the workflow on master, btw, I'm not sure if that will take effect on next commit or what. I really need to see the tests pinging off since there have been a few things here . You might need to merge/rebase from upstream.

rwnx commented 3 years ago

hey @babariviere, I think there might have been a misunderstanding regarding what i was asking for in the last review. I think that testing for the casting behaviour working is a good idea(!) I don't want to revert the tests you've added - I think that the fixtures being used specifically should have get_data_type return FakeDataType.STRING instead of the string 'UUID'.

Other than that It's looking great. I can't wait to merge it! ✨

babariviere commented 3 years ago

@jerometwell Oh sorry ! I will put it back then ! 😄