rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
102 stars 38 forks source link

Not able to support randomizing a jsonb type #91

Closed stoiven closed 2 years ago

stoiven commented 2 years ago

Describe the bug Trying to randomize a specific portion in a json field, but there is no specific expression to declare it as.

To Reproduce Steps to reproduce the behavior: In the .yaml file, have this:

  table:
    columns:
      params: 
        type: literal
        value: CASE WHEN params->>'email' ~ LOWER('website.com[[:>:]]') THEN params->>'email' ELSE CONCAT(DIGEST(LOWER(params->>'email'), 'sha1'), '@websitefaker.com') END

The structure of the table is:

     Column      |            Type             |                           Modifiers
-----------------+-----------------------------+---------------------------------------------------------------
 blob_random     | integer                     | not null default nextval('table_name_id_seq'::regclass)
 random_id       | text                        | not null
 completion_time | timestamp without time zone | default now()
 params          | jsonb                       |

The params field is this whole blob of json fields e.g.:

{
  "blob": 173467,
  "blobby": 20274,
  "service": "random",
  "owners": [
    "chris@googlefake.com"
  ]
}

When trying to anonymize, I get ERROR: column "params" is of type jsonb but expression is of type text

Expected behavior A clear and concise description of what you expected to happen.

The value output should change only the specified params (tested this on the local database itself, so it should work here) Ran a SELECT CASE WHEN params->>'email' ~ LOWER('website.com[[:>:]]') THEN params->>'email' ELSE CONCAT(DIGEST(LOWER(params->>'email'), 'sha1'), '@websitefaker.com') END FROM table_name

So the output should be:

{
  "blob": 173467,
  "blobby": 20274,
  "service": "random",
  "owners": [
    "sdfkj@websitefaker.com"
  ]
}
rwnx commented 2 years ago

Have you tried an update statement using that CASE WHEN [...] statement manually?

For reference, pynonymizer would put a literal into this kind of structure:

UPDATE table
SET params = (CASE WHEN params->>'email' ~ LOWER('website.com[[:>:]]') THEN params->>'email' ELSE CONCAT(DIGEST(LOWER(params->>'email'), 'sha1'), '@websitefaker.com') END)

That doesn't look quite right to me!

Doesn't postgres' ->> operator output text, rather than jsonb, which you'd need for this to work in an update statement? i.e , i think you might need to use -> in the THEN, and you'd also need to return the whole json with the field changed.

https://www.postgresql.org/docs/12/functions-json.html

At the current time, pynonymizer doesn't support json-field manipulations, because it would be a database-specific thing. Happy to discuss that as a feature in a different issue!

stoiven commented 2 years ago

Ah sorry I forgot to update this! But yea, this is totally fine. I found a workaround :)

Used the jsonb_set function to manipulate the data. Thank you for the response and appreciate this tool a lot!