trinodb / trino-python-client

Python client for Trino
Apache License 2.0
309 stars 151 forks source link

Support trailing semicolon in sql statements #277

Closed bendemott closed 1 year ago

bendemott commented 1 year ago

Naive implementation of removal of optional trailing semicolon. This allows sql statements passed to cursor.execute() to contain an optional trailing semicolon.

Description

Pandas and several other tools that generate SQL statements and use this driver in a generic way will append semicolon's to the end of their statements. Optional SQL statements are allowed at the end of many DBAPI drivers for other databases.

Example:

In Pandas the following statement will fail with Trino because it adds a trailing semicolon:

df.to_sql('users', con=engine)

To work-around this, the following code is required:

import pandas as pd
from sqlalchemy import create_engine, insert
from sqlalchemy.engine import Connection
from typing import Iterator

def fixsql(pd_table, conn: Connection, keys: list, data_iter: Iterator):
    """
    Custom function to hack around issue with Pandas adding trailing semi-colon.
    If the statement that Pandas generates contains a trailing semicolon, remove
    it before actually executing the query.
    """
    data = [dict(zip(keys, row)) for row in data_iter]
    executable = insert(pd_table.table).values(data) # sqlalchemy.sql.dml.Insert
    statement = str(executable.compile(dialect=conn.dialect, compile_kwargs={"literal_binds": True}))
    print(f'SQL STATEMENT IS: {statement}')

    # remove the trailing semicolon if required.
    if statement.strip().endswith(';'):
        statement = statement.rstrip(';', 1)

    # conn.execute can take a string or a `sqlalchemy.sql.expression.Executable`
    result = conn.execute(statement)
    return result.rowcount

engine = create_engine('trino://user:pwd@example:8080/hive/default', echo=False)
df = pd.DataFrame({'name' : ['User 1', 'User 2', 'User 3']})
# Pass our "method=fixsql" which will fix the sql before it's submitted to (Trino)
df.to_sql('users', con=engine, method=fixsql)
with engine.connect() as conn:
    df2 = pd.read_sql('select * from users', conn)
    print(df2.head())

Non-technical explanation

Allows sql statements to contain an optional trailing semicolon.

Release notes

( ) This is not user-visible or docs only and no release notes are required. (x) Release notes are required, please propose a release note for me. ( ) Release notes are required, with the following suggested text:

bendemott commented 1 year ago

Allow users to pass a method/single-method class which can do whatever pre-processing they want on queries (via the Connection constructor). The benefit is then people are free to do what they want and they can also opt into this behaviour instead of this being default behaviour.

Hey @hashhar

The semicolon character is a statement terminator. It is a part of the ANSI SQL-92 standard

I don't think adding a custom hook makes sense, this is part of the sql standard, and any database that claims to implement that standard should support statements terminated with a semicolon.

The custom hook solution doesn't work (easily) for sqlalchemy where you initialize the connection via URI. You're also introducing additional custom configuration for something that should be standard behavior.

Imagine having 1,000 users of jupyter notebooks using Trino and having to train each one to write a custom lambda that properly removes trailing semicolons.

If you don't feel this belongs in the driver I can understand that - Should this be implemented at the engine level - Happy to open a PR there?

From the SQL 92 standard:

          X3H2-92-154/DBL CBR-002
         20.1 <direct SQL statement>

        <direct SQL statement> ::=
              <directly executable statement> <semicolon>
hashhar commented 1 year ago

Thanks for linking to spec - that provides strong case to support it on the server. The other benefit being that we won't silently start swallowing any additional queries present after the semicolon.

I don't believe this should live in the driver. But if it had to I'd prefer something more generic than just for this one use-case to avoid growing multiple such special handlings over the driver.

cc: @martint

bendemott commented 1 year ago

The other benefit being that we won't silently start swallowing any additional queries present after the semicolon.

This implementation is not at risk of that, it will only trim the semicolon when its at the end of the line, and followed by whitespace. The whitespace is only stripped at the RIGHT side of the string.

There is no "clever" logic in the code attempting to perform string parsing, it simply is... Does the statement end with a semicolon and optional whitespace - ok then, remove the semicolon at the very end of the line.

The current state of things is the engine throws up errors when a semicolon is added to the end of a statement. In the new state, even if someone added multiple statements separated by semicolons, the engine would still throw up an error as multiple statements aren't allowed.

I was very careful to ensure it's essentially impossible to "clobber" the query in any way, in addition I even preserve the whitespace at the end of the line, so a user wouldn't be surprised by trailing whitespace being removed from their query.

I'll wait on on comment from Martin.

willmostly commented 1 year ago

We have extended our base lexer to support the semicolon terminator. This lexer is used in trino-cli.

hashhar commented 1 year ago

The options to move forward seem to be:

cc: @martint. PTAL and share your opinion about doing this on the server for all clients to benefit.

martint commented 1 year ago

It makes sense to support semicolons as statement terminator on the server side -- per the SQL spec, it's supposed to be that way. Note that this doesn't mean that the server would support multiple statements in a single invocation.

However it is also more work and might require more consideration than doing the change in a client.

@hashhar, what do you envision as "more work"? It's likely a single-line change to the SQL grammar in Trino, plus any parser tests we may want to add. We might need to look at the Trino CLI, though, which currently does special handling of semicolons on the client side.

Needs more tests

  • two queries separated by semicolons
  • a prepared statement for a query including a semicolon

Just wanted to highlight that adding support for semicolons on the server side doesn't mean that it will be possible to run multiple statements in one invocation -- that's a completely separate topic with far-reaching implications.

bendemott commented 1 year ago

I'll be opening a PR in Trino to add support for terminating semicolons in the dialect, closing this PR.

andreapiso commented 1 year ago

What is the status? pandas to_sql fails on both normal connection and sqlalchemy (with different errors)