shinichi-takii / ddlparse

DDL parase and Convert to BigQuery JSON schema and DDL statements
https://pypi.org/project/ddlparse/
BSD 3-Clause "New" or "Revised" License
87 stars 29 forks source link

Postgres/Redshift : Not parse DDL of "::" syntax in field attribute #11

Closed shinichi-takii closed 4 years ago

shinichi-takii commented 6 years ago

Example

DDL (Redshift)

CREATE TABLE IF NOT EXISTS "sample_schema"."sample_table"
(
    "div_cd" VARCHAR(2) NOT NULL
    ,"div_name" VARCHAR(30) NOT NULL
    ,"org_cd" VARCHAR(8) NOT NULL
    ,"org_name" VARCHAR(60) NOT NULL
    ,"team_cd" VARCHAR(2) NOT NULL
    ,"team_name" VARCHAR(120) NOT NULL
    ,"personal_cd" VARCHAR(7) NOT NULL
    ,"personal_name" VARCHAR(300) NOT NULL
    ,"username" VARCHAR(6) NOT NULL
    ,"staff_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
    ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
)
DISTSTYLE EVEN
;

ddlparse output BigQuery JSON schema

Not parse leader_flg field.

[{
    "name": "div_cd", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "div_name", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "org_cd", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "org_name", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "team_cd", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "team_name", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "personal_cd", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "personal_name", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "username", "type": "STRING", "mode": "REQUIRED"
}, {
    "name": "staff_flg", "type": "STRING", "mode": "NULLABLE"
}]
BhuviTheDataGuy commented 4 years ago

Im getting a different error while parsing the about SQL query of RedShift. Seems its considering the Encoding part as data type.

* TABLE *
schema = sample_schema : name = sample_table : is_temp = False
* BigQuery Fields *
Traceback (most recent call last):
  File "c:/Users/rbhuv/Desktop/code/testing.py", line 35, in <module>
    print(table.to_bigquery_fields())
  File "C:\Users\rbhuv\AppData\Roaming\Python\Python38\site-packages\ddlparse\ddlparse.py", line 427, in to_bigquery_fields
    return self._columns.to_bigquery_fields(name_case)
  File "C:\Users\rbhuv\AppData\Roaming\Python\Python38\site-packages\ddlparse\ddlparse.py", line 365, in to_bigquery_fields
    bq_fields.append(col.to_bigquery_field(name_case))
  File "C:\Users\rbhuv\AppData\Roaming\Python\Python38\site-packages\ddlparse\ddlparse.py", line 292, in to_bigquery_field
    type = self.bigquery_legacy_data_type
  File "C:\Users\rbhuv\AppData\Roaming\Python\Python38\site-packages\ddlparse\ddlparse.py", line 256, in bigquery_legacy_data_type        
    return self.bigquery_data_type
  File "C:\Users\rbhuv\AppData\Roaming\Python\Python38\site-packages\ddlparse\ddlparse.py", line 250, in bigquery_data_type
    raise ValueError("Unknown data type : '{}'".format(self._data_type))
ValueError: Unknown data type : 'LZO'

Reproduce this issue:

from ddlparse import DdlParse

sample_ddl = """
CREATE TABLE IF NOT EXISTS "sample_schema"."sample_table"
(
    "div_cd" VARCHAR(2) NOT NULL
    ,"div_name" VARCHAR(30) NOT NULL
    ,"org_cd" VARCHAR(8) NOT NULL
    ,"org_name" VARCHAR(60) NOT NULL
    ,"team_cd" VARCHAR(2) NOT NULL
    ,"team_name" VARCHAR(120) NOT NULL
    ,"personal_cd" VARCHAR(7) NOT NULL
    ,"personal_name" VARCHAR(300) NOT NULL
    ,"username" VARCHAR(6) NOT NULL
    ,"staff_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
    ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
)
DISTSTYLE EVEN
;

"""

# parse pattern (1)
table = DdlParse().parse(sample_ddl)

# parse pattern (2)
parser = DdlParse()
parser.ddl = sample_ddl
table = parser.parse()

print("* TABLE *")
print("schema = {} : name = {} : is_temp = {}".format(table.schema, table.name, table.is_temp))

print("* BigQuery Fields *")
print(table.to_bigquery_fields())

print("* BigQuery Fields - column name to lower case / upper case *")
print(table.to_bigquery_fields(DdlParse.NAME_CASE.lower))
print(table.to_bigquery_fields(DdlParse.NAME_CASE.upper))

print("* COLUMN *")
for col in table.columns.values():
    print("name = {} : data_type = {} : length = {} : precision(=length) = {} : scale = {} : constraint = {} : not_null =  {} : PK =  {} : unique =  {} : BQ {}".format(
        col.name,
        col.data_type,
        col.length,
        col.precision,
        col.scale,
        col.constraint,
        col.not_null,
        col.primary_key,
        col.unique,
        col.to_bigquery_field()
        ))

print("* Get Column object (case insensitive) *")
print(table.columns["total"])
shinichi-takii commented 4 years ago

@BhuviTheDataGuy Thank you for reporting.

I fixed this issue. Please upgrade the package to v1.6.1.

$ pip --no-cache-dir install --upgrade ddlparse

Thanks

BhuviTheDataGuy commented 4 years ago

Awesome, it worked.

But the total class is giving an error now.

Traceback (most recent call last):
  File "c:/Users/rbhuv/Desktop/code/testing.py", line 57, in <module>
    print(table.columns["total"])
  File "C:\Users\rbhuv\AppData\Roaming\Python\Python38\site-packages\ddlparse\ddlparse.py", line 337, in __getitem__
    return super().__getitem__(key.lower())
KeyError: 'total'
shinichi-takii commented 4 years ago

Hello, @BhuviTheDataGuy

This statement calls the total column. But, your DDL does not have a total column. That is the cause of the error.

For example, this DDL should work.

CREATE TABLE IF NOT EXISTS "sample_schema"."sample_table"
(
  "div_cd" VARCHAR(2) NOT NULL
  ,"div_name" VARCHAR(30) NOT NULL
  ,"org_cd" VARCHAR(8) NOT NULL
  ,"org_name" VARCHAR(60) NOT NULL
  ,"team_cd" VARCHAR(2) NOT NULL
  ,"team_name" VARCHAR(120) NOT NULL
  ,"personal_cd" VARCHAR(7) NOT NULL
  ,"personal_name" VARCHAR(300) NOT NULL
  ,"username" VARCHAR(6) NOT NULL
  ,"staff_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
  ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
  ,"total" INTEGER
)
DISTSTYLE EVEN
;

Please check it.

Thank you.

EnyMan commented 3 years ago

The fix provided is not sufficient as I have CREATE TABLE with column created_at timestamp without time zone DEFAULT ('now'::text)::timestamp without time zone ENCODE az64 and the parser splits the ::timestamp without time zone as two new columns one without with type time and another zone with type ENCODE