psycopg / psycopg2

PostgreSQL database adapter for the Python programming language
https://www.psycopg.org/
Other
3.35k stars 506 forks source link

copy_expert doesn't support NULL character escaping, Null or \N automatically becomes a string #1711

Closed chrislungu closed 3 months ago

chrislungu commented 3 months ago

Please complete the following information:

Describe the bug

I am working on a ELT pipeline that uses psycopg2 as the main (and only) connection between python and my database (postgres). I am not using sqlalchemy or panda's native to_sql() as they're not fit for purpose. My data sometimes has missing values (SQL [null] equivalent), which I obviously need reflected in my staging table. From that angle, it's the most simple thing I want to do: have data, have the same data in db.

When copying the data into my staging table (unlogged, all data types text), postgres shows the Null values as a string ("Null") as opposed to the "empty" SQL standard [null]. I tried everything, including converting NaN to None in pandas and converting Nan to \N in my iterator, but whatever I do, psycopg2 assumes everything is a string. I can't input any other data types, as it will only accept strings.

copy_from() might work (haven't tried) because it allows you to specify the [null] character (null="\\N"), but this option is not available in copy_expert(). using copy_from() is also not a possibility, because it won't accept schema's and table names in one go.

how do I go around this issue?

thanks!

If possible, provide a script reproducing the issue.

import re
import pandas as pd

df = pd.read_csv("../src/data.csv", dtype="object")

# even tried this
# data.replace(np.NaN, None, inplace = True)
# data.replace(pd.NaT, None, inplace = True)

df.columns = df.columns.map(lambda column: re.sub(r"[^a-zA-Z0-9_ ]", "", column).replace(" ", "_").strip().lower())

def iterator():
    for index, row in df.iterrows():
        yield "|".join(["\\N" if pd.isna(value) else str(value) for value in row]) + "\n" OR
        yield "|".join(["NULL" if pd.isna(value) else str(value) for value in row]) + "\n"

data = IterateData(interator())

query = sql.SQL(
    """ copy {schema}.{table} from stdin with csv delimiter as '|'; """
).format(
        schema=sql.Identifier("schema"),
        table=sql.Identifier("table")
    )

    cur.copy_expert(sql=query, file=data, size=8192)
    conn.commit()
chrislungu commented 3 months ago

ok, I wasn't thinking clearly :). this works:

copy {schema}.{table} from stdin with csv delimiter as '|' null as '\\N';

dvarrazzo commented 3 months ago

Cool, no problem.

Please take a look at psycopg 3 copy, because you can pass it Python values instead of formatting your string manually, so you shouldn't have this problem.

https://www.psycopg.org/psycopg3/docs/basic/copy.html#copy