rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
101 stars 38 forks source link

Python: int() argument must be a string, a bytes-like object or a real number, not 'NoneType' #129

Closed jervalles closed 11 months ago

jervalles commented 1 year ago

Describe the bug When I start a python script I have this error: Python: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

This is related to "seed_rows"

Tried to set seed_rows diretly in the package to 150 or "150" and it didnt work.

here my code:

import pynonymizer
import os 
import dj_database_url

db_config = dj_database_url.config(default=os.environ['DATABASE_URL'])

current_working_directory = os.getcwd()

conf_file_path = os.path.join(current_working_directory, "scripts", "conf.yaml")
input = os.path.join(current_working_directory, "scripts", "dump.sql")
outputfile = aa = os.path.join(current_working_directory, "scripts", "test.sql")

print(input)

pynonymizer.run(
    db_type="postgres",
    input_path=input,
    dj_database_url=os.environ['DATABASE_URL'],
    output_path=outputfile,
    strategyfile_path=conf_file_path,
    db_name="postgres",
    db_user="postgres",
    db_password="postgres",
    db_host=db_config['HOST'],
    db_port=db_config['PORT']
)

When I launch it in direct command like, it works but I wanna use Python because I didn't get how to use Providers:

pynonymizer -i dump.sql -s conf.yaml -o test.sql --strategy conf.yaml -t postgres -u postgres -p postgres

rwnx commented 1 year ago

Hi, In your example it doesn't look like you're setting seed_rows.

if your error stacktrace mentions this line it's because seed_rows is required when you use the python interface (pynonymizer.run).

I think this error message is super confusing and not at all helpful, though!

rwnx commented 1 year ago

If this doesn't solve your problem, can you paste the full stacktrace/error so we can take a closer look together?

jervalles commented 1 year ago

@rwnx thanks for your fast answer.

I added seed_rows to the args and tried "150" or even 150 but got this error:

expected str, bytes or os.PathLike object, not int On: self.pid = _posixsubprocess.fork_exec

image

here my code:

pynonymizer.run(
    db_type="postgres",
    input_path=input,
    dj_database_url=os.environ['DATABASE_URL'],
    output_path=outputfile,
    strategyfile_path=conf_file_path,
    db_name="postgres",
    db_user="postgres",
    db_password="postgres",
    db_host=db_config['HOST'],
    db_port=db_config['PORT'],
    seed_rows=150
)
rwnx commented 1 year ago

Can I ask, what makes you think this is related to seed_rows?

As i see it in the stack trace, this is failing in the create_database function.

I think it's related to the arguments passed to subprocess.check_output, so any one of these arguments is coming back as an int: db_host, db_port, db_user. (likely to be db_port?)

can you try db_port=str(db_config['PORT']) and see if it's that?

jervalles commented 1 year ago

it worked with :

pynonymizer.run( db_type="postgres", input_path=input, dj_database_url=str(os.environ['DATABASE_URL']), output_path=outputfile, strategyfile_path=conf_file_path, db_user="postgres", db_password="postgres", db_host=str(db_config['HOST']), db_port=str(db_config['PORT']), seed_rows=150 )

Thanks!

Can abuse a little and ask how are we supposed to use Providers here? By example, I wanna have a unique random string but random_int has only 4 digits and it's not unique. I tried to understand the documentation avout these Providers, but ....

(I don't even know if a provider like that exists)

rwnx commented 1 year ago

Providers are not good for random things, as they generate a fixed number of seed_rows (not unique, same as the other fake types)

If you need uniqueness/randomness I'd recommend using unique_login or a literal that's suitable for your database:

column_name: ( md5(random()::text) )

further reading on postgres random strings https://stackoverflow.com/questions/3970795/how-do-you-create-a-random-string-thats-suitable-for-a-session-id-in-postgresql

rwnx commented 1 year ago

just to be completely clear about what providers are for: they're for when you want to add more fake_types but you're unhappy about the ones that faker provides. e.g. you want a specific format. This may not be clear from the documentation - if you have any suggestions on how to make this clearer, I'd love to hear it :)

jervalles commented 1 year ago

Didn't know we could use

Providers are not good for random things, as they generate a fixed number of seed_rows (not unique, same as the other fake types)

If you need uniqueness/randomness I'd recommend using unique_login or a literal that's suitable for your database:

column_name: ( md5(random()::text) )

further reading on postgres random strings https://stackoverflow.com/questions/3970795/how-do-you-create-a-random-string-thats-suitable-for-a-session-id-in-postgresql

Thanks for your answer! I could make it work.

I have nothing to say about the documentation itself. I've just didn't understood how to implement

rwnx commented 11 months ago

Closing this as I think we resolved it! Open a new issue if you have something else :)