rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
102 stars 38 forks source link

Setting (perhaps global) to only anonymise non-empty columns #77

Closed miff2000 closed 6 months ago

miff2000 commented 3 years ago

To keep my anonymised database as true-to-life as the source, I want to only update non-empty columns. To achieve that I currently do this:

  - table_name: customers
    type: update_columns
    columns:
      customer_company:
        type: fake_update
        fake_type: company
        where: customer_company!= ""
      customer_title:
        type: fake_update
        fake_type: prefix
        where: customer_title!= ""
      customer_firstname:
        type: fake_update
        fake_type: first_name
        where: customer_firstname!= ""
      customer_surname:
        type: fake_update
        fake_type: last_name
        where: customer_surname!= ""

While this works, it's a little laborious and makes the strategy file very long.

Is there a way to set "only update if not empty" globally somehow? Or, failing that, at any other level?

rwnx commented 3 years ago

Hi, There's no way to do this currently.

I'm not really on board with a global setting as it only applies to one type of table update, but i feel your pain with the verbosity of the where key (and the expanded syntax).

one solution that has worked for some update types has been introducing a compact syntax, i.e. for the where statements, but i'd like to talk more about what this might look like. e.g.

# proposal for compact pattern matching ${fake_type} WHERE ${where condition}   
      customer_surname: last_name WHERE customer_surname != ""

# equivalent to 
      customer_surname:
        type: fake_update
        fake_type: last_name
        where: customer_surname!= ""

I think this is particularly important because your use case defines non-empty as != '' but it's fairly trivial to imagine other definitions of empty as NULL, for example, so i have trouble advocating for a update non-empty strategy.

I'd be interested to hear your thoughts, though!

miff2000 commented 3 years ago

Thanks for getting back to me so quickly!

I guess the only way a global setting could work is to have an option like update_only_when: != "" with a sane default. That could then be overridden when a where: is used in a strategy. I do see your point about it being only relevant if doing an update though.

Your suggestion sounds like a good compromise though, and does allow a much wider range of logic to be used

rwnx commented 3 years ago

I think we need a bit more time to gauge interest and brainstorm this one into a more concrete feature request.

My concern is mostly around making an existing feature WHERE harder to use/ adding features that do have overlapping features.

We can track any progress here!

miff2000 commented 3 years ago

I think we need a bit more time to gauge interest and brainstorm this one into a more concrete feature request.

Cool. So what do you think we should hash out more?

rwnx commented 3 years ago

So I've been thinking your suggestion for a global/default.

I think we can have the best of both worlds with a default where setting, which keeps the concepts we currently have and builds on it.

If you want to be able to default this, you'd have to be able to parameterize the column name, since it would be different in every column. At the moment, where settings are just appended after the WHERE keyword in the query, so as i see it you'd need to be able to do something like:

default_where: ${column} != ''

table_name:
  columns:
    column1: fake_type # [...] WHERE column1 != '';

This parameterization would be useful within individual WHEREs as well, so I think this could work.

We'll also be setting a precedent for the idea of "defaults" in the strategyfiles, which i'm on board with, but i wanted to ask whether setting the same options was something that comes up for you often?

miff2000 commented 3 years ago

Yeah, I use that where: <column> != "" in most places, where possible. I also often refer to other column names, to keep names, addresses, etc. consistent across a whole table.

I did also wonder whether there might be a use case for Jinja 2 templating. That might enable compound fakers like {{ first_name }}@{{ domain_name }} for instance, and perhaps could serve to parameterize the column name? I use Jinja extensively inside Ansible

miff2000 commented 3 years ago

Did that answer help at all?

rwnx commented 3 years ago

@miff2000 Yeah it did, more context is always useful - I have a limited insight into how people actually use the tool over the supported features, so it's cool to see new ideas and suggestions. Thanks for coming back to me, I'm struggling a bit for time at the moment so my contributions are going to be a little sporadic!

miff2000 commented 3 years ago

I totally understand. I'm struggling to find time too these days, though it doesn't dampen the enthusiasm!

What are your thoughts as to a design then? Would you see Jinja fitting in there too or is that not a direction you'd want to go? My Python is pretty limited but I know someone who can probably get a decent PR in if we know the direction

rwnx commented 3 years ago

The features I'd like to support/ I think would have life in the project (and also solve your issues!) are:

  1. Parameterized WHERE statements (allowing ${column} or similar token)
  2. Global defaults for some settings, starting with fake_update/where

Parameterized/compound fake_updates is something I'd like to support too, although it's a little more involved - allowing template eval would effectively change the way users interact with the faker generators entirely... Maybe we could spin this one out into a new issue?

FWIW, custom formats are also partially supported by custom generators (for #72): there's an example in the mysql integration test.

james-green-affinity commented 2 years ago

Here is a snip of how I am using your library

---
  tables:
    sh1d8_users_field_data:
      columns:
        name:
          type: unique_login
          where: uid > 1

I would love to be able to do something like this

    sh1d8_users_field_data:
      columns:
        name:
          type: unique_login
          where: uid > 1 AND != $[list_of_uids]

My use case for this is to exclude certain fields from getting anonymized.

This is a helpful feature in order for developers to be able to retain their accounts when you migrate from a production to --> integrate --> Test environment.

I of course will make sure all passwords get blanked, so it will force everyone to reset.

rwnx commented 2 years ago

Hi @james-green-affinity, did you mean to post your comment here? it seems like it doesn't relate to this feature request. Can you make a new issue if this is something new, or clarify how it relates to this thread?

In a comment, though: the where field is passed verbatim to your db provider, and I'm pretty sure your use-case can be solved with sql:

    sh1d8_users_field_data:
      columns:
        name:
          type: unique_login
          where: uid > 1 AND uid IN (1, 5, 10, 32432)

more info on the IN operator / array comparisons here:

rwnx commented 6 months ago

Closing as stale and out of scope. Please open a discussion if you want to talk more and gather community support. https://github.com/rwnx/pynonymizer/discussions