shinichi-takii / ddlparse

DDL parase and Convert to BigQuery JSON schema and DDL statements
https://pypi.org/project/ddlparse/
BSD 3-Clause "New" or "Revised" License
87 stars 29 forks source link

Oracle `NUMBER(*,0)` format causes column parsing to stop #45

Closed tb0n3zz closed 4 years ago

tb0n3zz commented 4 years ago

Very excited to find this -- just what I need, so thank you very much for creating it! I created a test to directly call Oracle 12.2.0.1 and run dbms_metadata.get_ddl, which is the standard method for producing DDL from the Oracle database using the CLI. It produces CREATE TABLE statements like this, which I then ran through ddlparse to convert to BigQuery -- but I found that most of the columns were missing in the resulting sql. Here's a simple runnable example where I removed the (*,0) from the first NUMBER column (COL2_WORKS):

from ddlparse.ddlparse import DdlParse

def main():

    oracle_ct = """
  CREATE TABLE "MYSCHEMA"."MYTABLE"
   (    "COL1" VARCHAR2(30) NOT NULL ENABLE,
    "COL2_WORKS" NUMBER NOT NULL ENABLE,
    "COL3" VARCHAR2(30) NOT NULL ENABLE,
    "COL4_BROKEN_AFTER_THIS" NUMBER(*,0) NOT NULL ENABLE,
    "COL5" VARCHAR2(8) NOT NULL ENABLE,
    "COL6" VARCHAR2(3) NOT NULL ENABLE
   ) ;
"""

    table = DdlParse().parse(ddl=oracle_ct, source_database=DdlParse.DATABASE.oracle)
    print(table.to_bigquery_ddl())

if __name__ == "__main__":
    main()

Note the NUMBER(*,0 designation. I found in testing in on many tables that as soon as ddlparse hits any NUMBER(*,0) data type column, it stops processing any more columns, so I get BigQuery DDL like this:

#standardSQL
CREATE TABLE `project.MYSCHEMA.MYTABLE`
(
  COL1 STRING NOT NULL,
  COL2_WORKS FLOAT64 NOT NULL,
  COL3 STRING NOT NULL,
  COL4_BROKEN_AFTER_THIS FLOAT64
)

I stepped through the code/breakpoints enough to understand that the issue is happening in the original column parsing, not in the conversion to BQ DDL. The BQ conversion is only called for the 4 columns.

I suspect the * is breaking a regex in ddlparse or pyparse. I'm out of time to figure out where, and I'd imagine you will be a lot faster at it since you are familiar with how you're using them, but I wanted to share this much since I got this far. For now, easily fixed by just preprocessing my Oracle CREATE TABLE strings to remove anything with the *.

Additional note: it is turning NUMBER(*,0) into a FLOAT64 when NUMBER(*,0) is an integer - but perhaps that's because of the broken regex as well / it's being cut off.

Thanks again for the code!

tb0n3zz commented 4 years ago

Figured I'd share the other thing I ran into and workaround -- preprocessing input DDL from Oracle as follows. Replacing with (38,0) makes (*,0) correctly convert to an INT64.

    # remove (*,0) from NUMBER defs that are integers anyway (38 is the max in variable data type, gets replaced anyway)
    massaged_ddl = re.sub(r'NUMBER\(\*,0\)', r'NUMBER(38,0)', oraddl)
    # chokes on this, tries to process like a column
    massaged_ddl = re.sub(r'SUPPLEMENTAL LOG DATA \(.*\) COLUMNS.*\n', '', massaged_ddl)
shinichi-takii commented 4 years ago

@tb0n3zz Thank you for using, and a issue report. I fixed issue to v1.4.0 .