trinodb / trino-python-client

Python client for Trino
Apache License 2.0
328 stars 163 forks source link

Convert numeric bindings to string for sqlalchemy.sql.String columns #213

Closed mdesmet closed 2 years ago

mdesmet commented 2 years ago

This PR will autoconvert numeric values to String when inserting or updating records through sqlalchemy targeting a String column.

SqlAlchemy allows to do inserts using following syntax:

users = sqla.Table('users',
                           metadata,
                           sqla.Column('id', sqla.Integer),
                           sqla.Column('name', sqla.String),
                           sqla.Column('fullname', sqla.String),
                           schema="test")
metadata.create_all(engine)
ins = users.insert()
conn.execute(ins, {"id": 2, "name": "wendy", "fullname": "Wendy Williams"})

The python types may not match the target type of the table. Some databases will autocast these values, while Trino doesn't as in following code:

import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('trino://user@localhost:8080/memory/default')
connection = engine.connect()

data = {'col1': ['hello', 2.0, 1]}
df = pd.DataFrame(data)

df.to_sql("test_sqlalchemy",    con=connection,    schema="default",    if_exists="replace",    index=False)

This will generate following exception in Trino

io.trino.spi.TrinoException: Insert query has mismatched column types: Table: [varchar], Query: [double]
    at io.trino.sql.analyzer.SemanticExceptions.semanticException(SemanticExceptions.java:48)
    at io.trino.sql.analyzer.SemanticExceptions.semanticException(SemanticExceptions.java:43)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitInsert(StatementAnalyzer.java:574)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitInsert(StatementAnalyzer.java:445)
    at io.trino.sql.tree.Insert.accept(Insert.java:68)
    at io.trino.sql.tree.AstVisitor.process(AstVisitor.java:27)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.process(StatementAnalyzer.java:462)
    at io.trino.sql.analyzer.StatementAnalyzer.analyze(StatementAnalyzer.java:425)
    at io.trino.sql.analyzer.Analyzer.analyze(Analyzer.java:79)
    at io.trino.sql.analyzer.Analyzer.analyze(Analyzer.java:71)
    at io.trino.execution.SqlQueryExecution.analyze(SqlQueryExecution.java:269)
    at io.trino.execution.SqlQueryExecution.<init>(SqlQueryExecution.java:193)
    at io.trino.execution.SqlQueryExecution$SqlQueryExecutionFactory.createQueryExecution(SqlQueryExecution.java:808)
    at io.trino.dispatcher.LocalDispatchQueryFactory.lambda$createDispatchQuery$0(LocalDispatchQueryFactory.java:135)
    at io.trino.$gen.Trino_386____20220727_080909_2.call(Unknown Source)
    at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
    at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
    at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
hashhar commented 2 years ago

Isn't this basically adding support for implicit casting on the client side? What's the motivation?

While implicit casts are convenient they change behaviour of the system and in user-visible ways so any change has to considered carefully since we won't get multiple passes at defining the behaviour.

cc: @findepi since CASTs and coercions.

findepi commented 2 years ago

My mental model for thinking about this -- based on the JDBC experience -- is that every client type (Java type, Python type) maps to some well defined type in SQL type system. For example a Java String -> varchar, Python str -> varchar. The client should be able to bind values of SQL types not directly expressible in the client type system. For example in JDBC this is achieved with PreparedStatement.setObject(int index, Object value, int sqlTypeCode). So far, the mental model served me well

Now, this all is quite a generic comment about what clients should be doing, based primarily on my JDBC experience. There might be something special about SQLAlchemy. Maybe it lacks some interfaces for typeful binds for SQL types? Or makes it hard for the calling application to provide data in the right types?

hovaesco commented 2 years ago

Or makes it hard for the calling application to provide data in the right types?

This is exactly the case we are facing in the Pandas to_sql function. Changing behaviour of implicit conversion shouldn't be done by default. I would opt for adding a flag which enables implicit conversion in SQLAlchemy dialect. Some functionalities of Trino Great Expectations integration are impacted by it.

mdesmet commented 2 years ago

They could simply fix it like this (with explicit conversion in pandas).

>>> import pandas as pd
>>> data = {'col1': [2.0, 1, 'test']}
>>> df = pd.DataFrame(data)
>>> df
   col1
0   2.0
1     1
2  test
>>> pd.api.types.infer_dtype(df['col1'])
'mixed-integer'
>>> df['col1'] = df['col1'].astype(str)
>>> df
   col1
0   2.0
1     1
2  test
>>> pd.api.types.infer_dtype(df['col1'])
'string'

However I tested the same with postgres. Postgres inserts and updates automatically convert from numeric to string without explicit cast. Note that this implicit cast does not work in the WHERE clause.

CREATE TABLE T (a VARCHAR);
INSERT INTO T (a) VALUES (1);
UPDATE T SET a=2 WHERE a='1';
select * from T;
findepi commented 2 years ago

@mdesmet the described behavior difference between Trino and PostgreSQL is engines' difference

Trino

trino:default> CREATE TABLE t(c varchar(20));
            -> INSERT INTO t VALUES (42);
CREATE TABLE
Query 20220809_142246_00011_sftar failed: Insert query has mismatched column types: Table: [varchar(20)], Query: [integer]

PostgreSQL

test=# CREATE TABLE t(c varchar(20));
CREATE TABLE
test=# INSERT INTO t VALUES (42);
INSERT 0 1
mdesmet commented 2 years ago

So in this case postgres is actually diverting from the SQL spec as a number is not assignable to a string.

I will close this PR.