perseas / Pyrseas

Provides utilities for Postgres database schema versioning.
https://perseas.github.io/
BSD 3-Clause "New" or "Revised" License
395 stars 67 forks source link

datacopy CSV with null trailing columns not importing with -u #231

Open jdearing-neudesic opened 3 years ago

jdearing-neudesic commented 3 years ago

I have a csv of data as follows:

1,APPAREL,
2,"BIOTECH, HEALTH CARE & PHARMA",
3,"FOOD, BEVERAGE & AGRICULTURE",

The this column is parentId, and its a self referencing foreign key:

table peergroup:
    columns:
    - id:
        identity: by default
        not_null: true
        type: integer
    - name:
        not_null: true
        type: character varying
    - parentid:
        type: integer
    foreign_keys:
      fk_peergroup_peergroup:
        columns:
        - parentid
        references:
          columns:
          - id
          schema: model
          table: peergroup
    owner: postgres
    primary_key:
      pk_peergroup:
        columns:
        - id
    unique_constraints:
      uq_peergroup_name:
        columns:
        - name

If I run yamltodb without -u and copy the generated commands into psql \copy works fine, If I do it with \u I get

Traceback (most recent call last):
  File "/home/jdearing/.local/bin/yamltodb", line 8, in <module>
    sys.exit(main())
  File "/home/jdearing/.local/lib/python3.8/site-packages/pyrseas/yamltodb.py", line 70, in main
    db.dbconn.copy_from(stmt[3], stmt[1])
  File "/home/jdearing/.local/lib/python3.8/site-packages/pgdbconn/dbconn.py", line 157, in copy_from
    curs.copy_from(f, table, sep)
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: ""
CONTEXT:  COPY peergroup, line 1, column parentid: ""
jmafc commented 3 years ago

It seems odd that psql would accept the input file since the third column values are missing. Have you tried adding a 0 at the end of each line to test? @dvarrazzo Sorry to bother you again, but if SQL copy (I realize it's a Postgres invention) is supposed to accept missing data values (as it apparently does) and supply a default (I'm not sure it's a NULL or a 0 that is being injected), then it seems that psycopg copy_from() would have to abide by this (weird IMHO) rule.

jdearing-neudesic commented 3 years ago

The docks for the COPY STATEMENT which I fully acknowledge is distincy from the \copy instruction of the psql executable claims that the default null is "an unquoted empty string in CSV format". Therefor since the CSV in question has a trailing comma on each line, unlike the other CSVs in my project which don't end in a comma, I can conclude that at least by the standards of the server side COPY statement that CSV has encoded a NULL value for parentId column.

tbussmann commented 3 years ago

To track this down:

Docs on \copy:

The syntax of this command is similar to that of the SQL COPY command. All options other than the data source/destination are as specified for COPY.

Docs on COPY:

COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]

FORMAT Selects the data format to be read or written: text, csv (Comma Separated Values), or binary. The default is text. DELIMITER Specifies the character that separates columns within each row (line) of the file. The default is a tab character in text format, a comma in CSV format. This must be a single one-byte character. This option is not allowed when using binary format. NULL Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.

So everything seems to depend on the FORMAT parameter. Let's see how that is set:

https://github.com/perseas/Pyrseas/blob/a96de397e1d48f9d901c581caffd74dc09165efb/pyrseas/yamltodb.py#L69-L70 here we pass only two parameters (likely path and table name) to pgdbconn's copy_from. There, the separator is set to it's default , and passed to psycopg2's curs_copy_from. This adds a default value to NULL and constructs a COPY SQL command like COPY %s%s FROM stdin WITH DELIMITER AS %s NULL AS %s

To conclude: DELIMITER is fixed set to pgdbconn's default ,, NULL is fixed set to psycopg2's default \N. In the end we are doing a simple text import which is the Postgres default for FORMAT which is not set (and not setable with the given abstraction layers) by perseas/psycopg2. Let's try to find out why:

Remember the syntax of that COPY command constructed? This doesn't look like what we have leant from the docs in the beginning. So read them again to find:

The following syntax was used before PostgreSQL version 9.0 and is still supported:

COPY table_name [ ( column [, ...] ) ]
FROM { 'filename' | STDIN }
[ [ WITH ]
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ]
[ CSV [ HEADER ]
[ QUOTE [ AS ] 'quote' ]
[ ESCAPE [ AS ] 'escape' ]
[ FORCE NOT NULL column [, ...] ] ] ]

Ha! FORMAT was not available back then!

jmafc commented 3 years ago

Thanks for the detailed trace, Tobias. Aside: your postgresql.org links use latest in the path instead of current. Although FORMAT may not have been available in 8.4 or earlier, CSV as a standalone keyword was supported (in fact, \copy even in PG 13 still accepts csv without a preceding format). In any case, going back to the commit a3fc1bc which implemented support for -u in processing \copy statements, it seems I had intended to add csv as an argument but I dropped the ball somewhere. Adding it now, would mean messing with pgdbconn, which IMO is another reason to reintegrate that code back into pyrseas.

tbussmann commented 3 years ago

Aside: your postgresql.org links use latest in the path instead of current.

thanks, fixed. Obviously a lack of QA ;)

Although FORMAT may not have been available in 8.4 or earlier, CSV as a standalone keyword was supported

right. And that does change NULL defaults, just like FORMAT CSV does. But the current defaults are supplied by the wrappers, not PostgreSQL itself.

Adding it now, would mean messing with pgdbconn, which IMO is another reason to reintegrate that code back into pyrseas.

actually - if I understand the code correctly - skipping pgdbconn would still not allow to specify 'CSV' as psycopg does not support this parameter at all. But it would be possible to pass an empty string for the 'null' parameter of it's curs_copy_from to mimic that behaviour.

jmafc commented 3 years ago

Thanks Tobias for the clarification. I see now that psycopg2 copy_from and copy_to do not support the csv parameter in any way, only copy_expert does. @jdearing-neudesic I'm afraid that at this point, the only thing we can do is document that if you're going to use yamltodb -u to load some tables, columns with SQL NULLs will have to be encoded as \N. Implementing a workaound with copy_expert seems to me as a step backward for very little gain, and adding an empty string in our code would only fix your situation but would break anyone using the standard default for NULLs, even those using dbtoyaml to export the data in the first place.

dvarrazzo commented 3 years ago

@dvarrazzo Sorry to bother you again, but if SQL copy (I realize it's a Postgres invention) is supposed to accept missing data values (as it apparently does) and supply a default (I'm not sure it's a NULL or a 0 that is being injected), then it seems that psycopg copy_from() would have to abide by this (weird IMHO) rule.

Hello,

copy_from and copy_to were a bad idea, for the same reason that there aren't cursor.select() or a cursor.insert() methods each one with a million of parameters and a few hundreds missing because they were introduced in a later Postgres release. After acknowledging this design error, the maintainer at the time added copy_expert(), which is the only thing that makes sense, because it accommodates any future Postgres feature and, like execute(), doesn't assume knowledge of the statement to send, but only of its protocol.

So no, there is no intention to support other parameters in copy_from(): the method is dropped from psycopg3, which only supports a cursor.copy() method (which supports not only file-like objects but also record streams and adaptations like normal queries).

The psycopg2.sql module, and its psycopg3.sql counterpart in the next release, allow to generate syntactically valid queries escaping table and field names correctly and allowing the user to use whatever feature is supported by the server they are talking to.