trinodb / trino-python-client

Python client for Trino
Apache License 2.0
307 stars 151 forks source link

Special characters not rendering properly in the latest version 0.322.0 #355

Closed Raghav-254 closed 1 year ago

Raghav-254 commented 1 year ago

Expected behavior

The output should handle the special characters properly like below:

Screenshot 2023-04-10 at 3 49 33 PM

Actual behavior

I was querying one table, and the output doesn't seem to be correct.

Screenshot 2023-04-10 at 3 45 43 PM

Basically the special characters are not being handled properly. This is specific to varbinary datatype.

Steps To Reproduce

This issue was encountered when I upgraded the trino-python-client from v0.306.0 to v0.322.0

You can reproduce this issue by running the below python script:

import os
import trino
import pandas as pd
import getpass

class TrinoSession:

    def __init__(self):
        self.conn = None
        self.cur = None

    def connect(self, server = 'faro', vip = None):
        from dotenv import load_dotenv
        load_dotenv()

        if server == 'holdem':
            host = 'enter host name, to be filled..'
        elif server == 'faro': # faro
            host = 'enter host name, to be filled..'
        else:
            raise ValueError("invalid server")

        if vip == None:
            vip = getpass.getpass('vip')

        conn=trino.dbapi.connect(
            host=host,
            port=8443,
            user=os.environ.get('secretUser'),
            catalog='hive',
            http_scheme='https',
            auth=trino.auth.BasicAuthentication('Enter the authentication details, to be filled..'))

        self.conn =  conn
        self.cur = conn.cursor()

    def close(self):
        self.cur.close()

    def query(self, q):
        "Execute query string. Use no semicolon."
        self.cur.execute(q)
        rows = self.cur.fetchall()
        df = pd.DataFrame(list(rows))
        df.columns = [desc[0] for desc in self.cur.description]

        return df

session = TrinoSession()
session.connect()
df = session.query("Enter your query, to be filled..")
print(df)
session.close()

Log output

Getting these log outputs as well:

Unsupported UTF-8 sequence length when encoding string

'utf-8' codec can't decode byte 0xff in position 337: invalid start byte

Operating System

NA

Trino Python client version

0.322.0

Trino Server version

NA

Python version

3.10

Are you willing to submit PR?

hovaesco commented 1 year ago

Could you please provide DDL for the table with some sample data?

Raghav-254 commented 1 year ago

Hi @hovaesco This is being observed with varbinary datatype values only. Can you please now have a look into this? Regards, Raghav Mittal

hovaesco commented 1 year ago

It looks like a binary value is not properly encoded as UTF-8. Please make sure you're using the X prefix when inserting and the data is properly formatted as hexadecimal string.

Raghav-254 commented 1 year ago

It is working fine with trino-cli as well as trino JDBC driver and with the trino-python-client v0.306.0 as well. It seems to be broken in the latest version of v0.322.0. Therefore probably it should not be the issue with the dataset.

Raghav-254 commented 1 year ago

Hi do you require any extra info from our side?

mdesmet commented 1 year ago

Can you provide a literal varbinary string that produces the issue in the form of SELECT X'<my binary string>'?

Also note that \x<binary> just refers to characters that are not printable in UTF8, it is not necessarily wrong.

Raghav-254 commented 1 year ago

Hi @mdesmet @hovaesco

You can follow the below process to reproduce the above mentioned issue

  1. Create the table with id column having the varbinary datatype. CREATE TABLE IF NOT EXISTS test_data ( testid VARBINARY );

  2. Insert values into the table INSERT INTO test_data (testid) VALUES (CAST('0123456789ABCDEF' AS VARBINARY)); INSERT INTO test_data (testid) VALUES (CAST('x0123456789ABCDEF' AS VARBINARY));

  3. Query the table using select statement. select testid from test_data

Following is the output I am getting while running via trino-python-client v0.322.0

Screenshot 2023-04-17 at 12 37 16 PM

Output I am getting while running via trino-python-client v0.306.0

Screenshot 2023-04-17 at 12 40 49 PM
Raghav-254 commented 1 year ago

Also I found that, it starts breaking from v0.321.0. Upto v.0.320.0 it is working fine. So you can also have a look at the changes which you have made while bumping up the version from v0.320.0 to v.0.321.0

Regards, Raghav Mittal

hashhar commented 1 year ago

In older versions results were always returned as strings. In newer versions they're mapped to Python types.

The values are equivalent and this is not a bug:

>>> x = b'0123456789ABCDEF'
>>> base64.b64encode(x)
b'MDEyMzQ1Njc4OUFCQ0RFRg=='

You can revert to old behaviour by enabling https://github.com/trinodb/trino-python-client#legacy-primitive-types.

You can also read the release notes when upgrading - https://github.com/trinodb/trino-python-client/blob/master/CHANGES.md#breaking-changes

Raghav-254 commented 1 year ago

Hi @mdesmet @hashhar

Thanks for your response.

I was able to render the strings correctly by passing this legacy_primitive_types argument as True. But after doing this, I lost the behavior of displaying row types as dict. The changes which you have made in this PR: https://github.com/trinodb/trino-python-client/pull/335

Basically I patched the above changes of your PR and then tried on v0.322.0 by making legacy_primitive_types as True.

Can you please suggest me the resolution of this issue? Regards, Raghav Mittal

hashhar commented 1 year ago

You can either use results as strings or results as Python types, not both. The string behaviour is only there to allow people to migrate to use Python types. I'd suggest you spend the time to either base64 decode the byte-array you get for VARBINARY columns instead of trying to hack around it.

Raghav-254 commented 1 year ago

ok thanks for your response, lemme try that out.

Raghav-254 commented 1 year ago

Hi @hashhar FYI this thing worked for me, you can refer the try except block in the below code. I made changes in the client.py file under class NamedRowTuple

class NamedRowTuple(tuple):
    """Custom tuple class as namedtuple doesn't support missing or duplicate names"""
    def __new__(cls, values, names: List[str], types: List[str]):
        return super().__new__(cls, values)

    def __init__(self, values, names: List[str], types: List[str]):
        self._names = names
        # With names and types users can retrieve the name and Trino data type of a row
        self.__annotations__ = dict()
        self.__annotations__["names"] = names
        self.__annotations__["types"] = types
        elements: List[Any] = []
        for name, value in zip(names, values):
            if names.count(name) == 1:
                setattr(self, name, value)
                try:
                    if isinstance(value, (bytes, bytearray)):
                        value = base64.b64encode(value).decode('utf-8')
                except Exception as e:
                    pass
                elements.append(f"{name}: {repr(value)}")
            else:
                elements.append(repr(value))
        self._repr = "(" + ", ".join(elements) + ")"
hashhar commented 1 year ago

I didn't mean for you to change the client code itself. I meant you can modify your script/code to become aware of the fact that VARBINARY columns will be Python byte-arrays and handle them accordingly.

There is no change to be done in the client itself.

Raghav-254 commented 1 year ago

ok thanks for your clarification!