lelit / pglast

PostgreSQL Languages AST and statements prettifier: master branch covers PG10, v2 branch covers PG12, v3 covers PG13, v4 covers PG14, v5 covers PG15, v6 covers PG16
347 stars 41 forks source link

parse_sql vs parse_plpgsql usage #88

Closed remingtonc closed 3 days ago

remingtonc commented 3 years ago

Hello! Nice projection! Trying to determine if this library could be utilized to parse many SQL statements to form an AST of a schema - or at least have the per-statement AST parsed in useful ways to formulate things about the schema. I am uncertain of the difference between the parse_sql and parse_plpgsql functions and how to utilize them correctly. An example...

CREATE OR REPLACE FUNCTION test.test_parse (
    p_time_start timestamptz,
    p_time_end timestamptz,
    p_time_interval interval default NULL
) RETURNS TABLE (
    ts timestamptz,
    arbitrary_return bigint
) AS $$
BEGIN
    -- some comment
    -- some other comment

    IF p_time_interval IS NULL
        THEN p_time_interval := interval_from_start_end(p_time_start, p_time_end);
    END IF;
    RETURN QUERY
    SELECT
        bucket_function(p_time_interval, timestamp) AS ts,
        arbitrary_return
    FROM test.some_table
    WHERE
        start >= p_time_start
        AND end < p_time_end
    GROUP BY 1;
END; $$ LANGUAGE plpgsql SECURITY DEFINER PARALLEL UNSAFE;

Using parse_plpgsql fails:

>>> parse_plpgsql(raw_sql)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/x/.venv/lib/python3.9/site-packages/pglast/__init__.py", line 31, in parse_plpgsql
    return loads(parse_plpgsql_json(statement))
  File "pglast/parser.pyx", line 346, in pglast.parser.parse_plpgsql_json
pglast.parser.ParseError: "p_time_interval" is not a known variable

Using parse_sql here works, but yields a big blob unparsed:

>>> stmt_ast = parse_sql(raw_sql)
>>> stmt_ast[0].stmt.options[0].arg[0].val
'\nBEGIN\n    -- some comment\n    -- some other comment\n\n    IF p_time_interval IS NULL\n        THEN p_time_interval := interval_from_start_end(p_time_start, p_time_end);\n    END IF;\n    RETURN QUERY\n    SELECT\n        bucket_function(p_time_interval, timestamp) AS ts,\n        arbitrary_return\n    FROM test.some_table\n    WHERE\n        start >= p_time_start\n        AND end < p_time_end\n    GROUP BY 1;\nEND; '

Trying to parse that inner blob as plpgsql similarly does not work:

>>> parse_plpgsql(stmt_ast[0].stmt.options[0].arg[0].val)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/x/.venv/lib/python3.9/site-packages/pglast/__init__.py", line 31, in parse_plpgsql
    return loads(parse_plpgsql_json(statement))
  File "pglast/parser.pyx", line 346, in pglast.parser.parse_plpgsql_json
pglast.parser.ParseError: syntax error at or near "IF", at index 58

Any tips?

Further - any heuristic to not need to explicitly declare "this function should be parsed as SQL" versus "this function should be parsed as PL/pgSQL", or do I need to decide that per statement?

lelit commented 3 years ago

Hi, unfortunately the plpgsql parser is not yet fully implemented, so I'm afraid it is of little utility as of now. I'm enjoying a short vacation so I cannot try your example: I will do once I get back.

lelit commented 2 years ago

As the related issue is now closed, I will try to get at this soon.

lelit commented 2 years ago

The parse error is now fixed in just released v3.6, but I'm leaving this open to remind me a possible clarification in the doc about other issues:

VaibhaveS commented 1 week ago

@lelit I am confused here as well, when to use either of the two. Can you please clarify.

lelit commented 1 week ago

I will try to clarify this in the documentation. In the meantime, expanding what I said in a previous comment above, the parse_plpgsql function is severely underpowered because while it properly execute the parsing of the statement as PostgreSQL would, what it returns is a a little more of a raw sequence of tokens, not an AST like parse_sql does.

Consider the following examples: they parse the same

from pprint import pprint
from pglast import parse_plpgsql

STMT = """\
CREATE FUNCTION add (a integer, b integer)
RETURNS integer AS $$
BEGIN 
  RETURN a + b;
END;
$$ LANGUAGE plpgsql
"""

as_plpgsql = parse_sql(STMT)
pprint(as_plpgsql)

This prints out

[{'PLpgSQL_function': {'action': {'PLpgSQL_stmt_block': {'body': [{'PLpgSQL_stmt_return': {'expr': {'PLpgSQL_expr': {'parseMode': 2,
                                                                                                                     'query': 'a '
                                                                                                                              '+ '
                                                                                                                              'b'}},
                                                                                           'lineno': 1}}],
                                                         'lineno': 1}},
                       'datums': [{'PLpgSQL_var': {'datatype': {'PLpgSQL_type': {'typname': 'UNKNOWN'}},
                                                   'refname': 'a'}},
                                  {'PLpgSQL_var': {'datatype': {'PLpgSQL_type': {'typname': 'UNKNOWN'}},
                                                   'refname': 'b'}},
                                  {'PLpgSQL_var': {'datatype': {'PLpgSQL_type': {'typname': 'UNKNOWN'}},
                                                   'refname': 'found'}}]}}]

that, as you can see, is just a list of plain Python dictionaries.

If you use parse_sql instead, you obtain a richer representation of the statement:

from pprint import pprint
from pglast import parse_sql

STMT = """\
CREATE FUNCTION add (a integer, b integer)
RETURNS integer AS $$
BEGIN 
  RETURN a + b;
END;
$$ LANGUAGE plpgsql
"""

as_sql = parse_sql(STMT)
pprint([stmt(skip_none=True) for stmt in as_sql])

This emits

[{'@': 'RawStmt',
  'stmt': {'@': 'CreateFunctionStmt',
           'funcname': ({'@': 'String', 'sval': 'add'},),
           'is_procedure': False,
           'options': ({'@': 'DefElem',
                        'arg': ({'@': 'String',
                                 'sval': '\nBEGIN \n  RETURN a + b;\nEND;\n'},),
                        'defaction': {'#': 'DefElemAction',
                                      'name': 'DEFELEM_UNSPEC',
                                      'value': 0},
                        'defname': 'as',
                        'location': 59},
                       {'@': 'DefElem',
                        'arg': {'@': 'String', 'sval': 'plpgsql'},
                        'defaction': {'#': 'DefElemAction',
                                      'name': 'DEFELEM_UNSPEC',
                                      'value': 0},
                        'defname': 'language',
                        'location': 96}),
           'parameters': ({'@': 'FunctionParameter',
                           'argType': {'@': 'TypeName',
                                       'location': 23,
                                       'names': ({'@': 'String',
                                                  'sval': 'pg_catalog'},
                                                 {'@': 'String',
                                                  'sval': 'int4'}),
                                       'pct_type': False,
                                       'setof': False,
                                       'typemod': -1},
                           'mode': {'#': 'FunctionParameterMode',
                                    'name': 'FUNC_PARAM_DEFAULT',
                                    'value': 'd'},
                           'name': 'a'},
                          {'@': 'FunctionParameter',
                           'argType': {'@': 'TypeName',
                                       'location': 34,
                                       'names': ({'@': 'String',
                                                  'sval': 'pg_catalog'},
                                                 {'@': 'String',
                                                  'sval': 'int4'}),
                                       'pct_type': False,
                                       'setof': False,
                                       'typemod': -1},
                           'mode': {'#': 'FunctionParameterMode',
                                    'name': 'FUNC_PARAM_DEFAULT',
                                    'value': 'd'},
                           'name': 'b'}),
           'replace': False,
           'returnType': {'@': 'TypeName',
                          'location': 51,
                          'names': ({'@': 'String', 'sval': 'pg_catalog'},
                                    {'@': 'String', 'sval': 'int4'}),
                          'pct_type': False,
                          'setof': False,
                          'typemod': -1}},
  'stmt_len': 0,
  'stmt_location': 0}]

The parse_plpgsql function is meant to parse more complex procedure language statements, containing PG extensions to the SQL language like loops, conditions and the like. But for now, it is of little utility, until someone implement a proper AST for it, either at the lower libpg_query level, or in pglast.

lelit commented 3 days ago

This should be fixed by referenced commit, in upcoming v6.5.