Impala: straight SELECT and SELECT via CTE produce different results

galadash commented 1 year ago

Environment

Python: Python 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)]
pyodbc: 4.0.38
OS: Windows 10 Enterprise 20H2 64-bit
DB: Impala
driver: Cloudera ODBC Driver for Impala 2.06.16.1022

Issue

Simply put, IS NOT NULL within a CASE WHEN statement do not generate the expected behaviour. Review this simple MWE:

import pyodbc

conn = pyodbc.connect('DSN=Hadoop LDAP', autocommit=True)
cur = conn.cursor()

cur.execute("""CREATE TABLE sandbox.jk_test (col1 int, col2 int)""")

cur.execute("""INSERT INTO sandbox.jk_test (col1, col2)
               VALUES
               (1,1),
               (2,NULL)""")

query1 = """SELECT
    col1, col2 as col2_orig, 
    (CASE WHEN col2 IS NOT NULL THEN col2
        ELSE 0 END) AS col2_1,
    (CASE WHEN col2 IS NULL THEN 0
        ELSE col2 END) AS col2_2
    FROM  sandbox.jk_test"""

query2 = f"""WITH wrapped_table AS ({query1}) SELECT * FROM wrapped_table"""

rows = cur.execute(query1)
col_names = [t[0] for t in rows.description]
col_lengths = [len(c) for c in col_names]
print(col_names)
for row in rows:
    print([str(i).ljust(l) for i, l in zip(row, col_lengths)])

print()
rows = cur.execute(query2)
col_names = [t[0] for t in rows.description]
col_lengths = [len(c) for c in col_names]
print(col_names)
for row in rows:
    print([str(i).ljust(l) for i, l in zip(row, col_lengths)])

cur.execute("""DROP TABLE sandbox.jk_test""")

Which results in the following output. I would expect both outputs to be the same, but it clearly is not.

['col1', 'col2_orig', 'col2_1', 'col2_2']
['1   ', '1        ', '1     ', '1     ']
['2   ', 'None     ', 'None  ', '0     ']

['col1', 'col2_orig', 'col2_1', 'col2_2']
['1   ', '1        ', '1     ', '1     ']
['2   ', 'None     ', '0     ', '0     ']

Things I have tried:

execute query using sqlalchemy -> everything as expected
execute query using db manager (dbeaver + Apache Hive JBDC connection) -> everything as expected
update pyodbc to latest version (error popped up when using 4.0.35) -> nothing changed

It's very strange behaviour, and using such construction in WHERE-filters works fine. Also if using the clause to actually do computations, it miraculously works.

E.g. executing following snippet:

SELECT
    col1, col2 as col2_orig, 
    (CASE WHEN col2 IS NOT NULL THEN col2
        ELSE 0 END) AS col2_1,
    (CASE WHEN col2 IS NULL THEN 0
        ELSE col2 END) AS col2_2,
    (CASE WHEN col2 IS NOT NULL THEN col1*1
        ELSE col1*2 END) AS col2_c
FROM  gda_sandbox.jk_test

generates:

['col1', 'col2_orig', 'col2_1', 'col2_2', 'col2_c']
['1   ', '1        ', '1     ', '1     ', '1     ']
['2   ', 'None     ', 'None  ', '0     ', '4     ']

In my case I will have to rewrite the SQL-generation process to convert these kind of checks into a COALESCE() statement, which would be fine, but it's strange behaviour, which I think should be solved, and might have unintended consequences for other executions? Unfortunately my C++ knowledge is limited, I have not been able to dig deeper in the code and find the solution myself. Hopefully someone can help me! 😄

mkleehammer commented 1 year ago

Thoughts:

1) There is a serious memory bug in 4.0.38 that is being corrected right now. Please retest with 4.0.39 when it is ready or downgrade to 4.0.35.

2) Have you executed these two SQL statements without pyodbc? This does not look like something pyodbc would affect.

3) If you really do think it is pyodbc, please try executing query2 before query1 to see if the output changes.

I'm betting on (2) right now, though.

gordthompson commented 1 year ago

You can test the ODBC driver outside of pyodbc using the following VBScript. (Run cscript issue_1197.vbs from a command prompt.)

' issue_1197.vbs

Set conn = CreateObject("ADODB.Connection")
conn.Open "DSN=Hadoop LDAP"

Sub ExecuteCommand(sql)
    If Left(sql, 5) = "WITH " Or Left(sql, 7) = "SELECT " Then
        Set rst = CreateObject("ADODB.Recordset")
        rst.Open sql, conn
        While Not rst.EOF
            For i = 0 To rst.Fields.Count - 1
                val = rst(i)
                If IsNull(val) Then
                    val = "None"
                End If
                WScript.StdOut.Write val & vbTab
            Next
            WScript.Echo
            rst.MoveNext
        Wend
        rst.Close
        Set rst = Nothing
    Else
        Set cmd = CreateObject("ADODB.Command")
        cmd.ActiveConnection = conn
        cmd.CommandText = sql
        cmd.Execute
        Set cmd = Nothing
    End If
End Sub

ExecuteCommand "CREATE TABLE sandbox.jk_test (col1 int, col2 int)"

ExecuteCommand("INSERT INTO sandbox.jk_test (col1, col2) " & _
               "VALUES " & _
               "(1,1), " & _
               "(2,NULL)")

query1 = "SELECT " & _
         "col1, col2 as col2_orig, " & _
         "(CASE WHEN col2 IS NOT NULL THEN col2 " & _
         "    ELSE 0 END) AS col2_1, " & _
         "(CASE WHEN col2 IS NULL THEN 0 " & _
         "    ELSE col2 END) AS col2_2 " & _
         "FROM  sandbox.jk_test"

query2 = "WITH wrapped_table AS (" & query1 & ") SELECT * FROM wrapped_table"

ExecuteCommand(query1)
WScript.Echo ""
ExecuteCommand(query2)

ExecuteCommand("DROP TABLE sandbox.jk_test")

conn.Close

galadash commented 1 year ago

Thanks @mkleehammer for your thoughts. I didn't consider it could be something related to the driver. I also wouldn't have had any idea of how to try it, so thanks @gordthompson for your proposal, I will give it a try next workday and come back to you!! Hopefully I can clarify things by next week.

galadash commented 1 year ago

So indeed it seems an issue with the driver, and not with pyodbc:

> cscript issue_1197.vbs
Microsoft (R) Windows Script Host Version 5.812
Copyright (C) Microsoft Corporation. All rights reserved.

1       1       1       1
2       None    None    0

1       1       1       1
2       None    0       0

I have updated to the latest ODBC driver available on Cloudera (2.06.17.1026), but that did not help. Furthermore, I've downloaded a trial version for CDATA's Impala driver, which manifests the expected behaviour (same results for both queries). I'm therefore pretty confident that this issue is with Cloudera's driver.

If anyone has experience with submitting issues to Cloudera, I'm all ears! I expect I'll have to post it on their community first...

Sorry for opening an issue here, but a million thanks for the help!

galadash commented 1 year ago

Just an update from my side in case anyone ever encounters this issue.

Selecting "Use Native Query" in the Advanced settings of the Cloudera ODCB Driver for Impala results in the expected outcome. I expect I am "loosing" some query optimization they do in the background, but as many queries I use are CTE from, it seems they are not affected anyway.

galadash commented 7 months ago

Just another update in case anyone else bumps into this issue: it has been solved in "Cloudera ODBC Driver for Impala" version 2.7.0.

Reference: https://community.cloudera.com/t5/Support-Questions/straight-SELECT-and-SELECT-via-CTE-produce-different-results/m-p/377501

mkleehammer / pyodbc