UTF-16-LE decode error - Githubissues

victoraugustolls commented 4 years ago

Environment

To diagnose, we usually need to know the following, including version numbers. On Windows, be sure to specify 32-bit Python or 64-bit:

Python: 3.7.4
pyodbc: 4.0.27
OS: Docker image = python:3.7-slim-stretch
DB: SQL Server (on Azure)
driver: ODBC Driver 17 for SQL Server

Running flask under gunicorn gevent worker

Issue

Randomly I receive the following error:

File "/usr/local/lib/python3.7/encodings/utf_16_le.py", line 15, in decode
def decode(input, errors='strict'):
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 196, in handle_abort
     sys.exit(1)
SystemExit: 1
The above exception was the direct cause of the following exception:
...
   File "/code/src/services/mssql.py", line 87, in execute_query
     cur.execute(query)

Tried reversing to 4.0.23 and still no good!

v-makouz commented 4 years ago

Is there a repro app? Does it happen during a particular query?

victoraugustolls commented 4 years ago

No particular query, just any query at random, but I can post my connection class and Dockerfile to help!

victoraugustolls commented 4 years ago

Connection class

import os

import pyodbc
from opencensus.trace import execution_context, status

from helpers import Formatter, Singleton
from helpers.config import MSSQL
from logger import new

LOG = new(__name__)

class MsSQLService(metaclass=Singleton):
    def __init__(self):
        try:
            tracer = execution_context.get_opencensus_tracer()
        except Exception:
            LOG.error("Failed to trace request", exc_info=True)
            raise

        try:
            with tracer.span(name=MSSQL["server_name"]) as _:
                tracer.add_attribute_to_current_span("dependency.type", "SQL")
                self.conn = pyodbc.connect(
                    (
                        "Driver={{ODBC Driver 17 for SQL Server}};Server={};"
                        "Database={};UID={};PWD={};"
                    ).format(
                        MSSQL["server_name"],
                        MSSQL["database"],
                        MSSQL["username"],
                        MSSQL["password"],
                    ),
                    autocommit=True,
                )
        except Exception as e:
            LOG.error(f"Connection exception: {e}")
            raise

    def execute_query(self, query, formatted=True):
        try:
            tracer = execution_context.get_opencensus_tracer()
        except Exception:
            LOG.error("Failed to get tracer", exc_info=True)
            raise

        with tracer.span(name=MSSQL["server_name"]) as _:
            tracer.add_attribute_to_current_span("dependency.type", "SQL")
            tracer.add_attribute_to_current_span("database.query", query)
            try:
                cur = self.conn.cursor()
                cur.execute(query)
                if formatted is True:
                    data = Formatter.format_cursor_mssql(cur)
                else:
                    data = None
            except Exception as e:
                LOG.info(f"MSSQL error: {e}")
                LOG.info(f"MSSQL error cause: {e.__cause__}")
                LOG.info(f"MSSQL error dict: {e.__dict__}")
                span = tracer.current_span()
                span.set_status(status=status.Status.from_exception(e))
                raise
            finally:
                cur.close()

            return data

victoraugustolls commented 4 years ago

Dockerfile

FROM python:3.7-slim-stretch

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    apt-utils \
    apt-transport-https \
    debconf-utils \
    build-essential \
    gcc \
    gnupg \
    locales && \
    echo 'en_US.UTF-8 UTF-8' > /etc/locale.gen && \
    locale-gen && \
    # adding custom MS and PostgreSQL repository
    curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \
    curl https://packages.microsoft.com/config/debian/9/prod.list > /etc/apt/sources.list.d/mssql-release.list && \
    echo 'deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg main' >> /etc/apt/sources.list.d/pgdg.list && \
    curl https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - && \
    # install SQL Server and PostgreSQL drivers
    apt-get update && ACCEPT_EULA=Y apt-get install -y --no-install-recommends \
    msodbcsql17 \
    mssql-tools \
    unixodbc-dev \
    libpq-dev \
    postgresql-client-10 && \
    echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bash_profile && \
    echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc && \
    /bin/bash -c "source ~/.bashrc" && \
    # clear installations
    apt-get remove -y apt-utils apt-transport-https build-essential curl gnupg && \
    rm -rf /var/lib/apt/lists/*

ENV PYTHONIOENCODING=utf-8

RUN mkdir -p /root/.opencensus/.azure/gunicorn

ADD requirements.txt .
RUN pip install -r requirements.txt

RUN mkdir /code
WORKDIR /code
ADD src /code/
ADD . /code/

COPY init.sh /usr/local/bin/

RUN chmod u+x /usr/local/bin/init.sh

ENTRYPOINT ["init.sh"]

victoraugustolls commented 4 years ago

I can also add that doing:

self.conn.setencoding(encoding="utf-8")
self.conn.setdecoding(pyodbc.SQL_CHAR, encoding="utf-8")
self.conn.setdecoding(pyodbc.SQL_WCHAR, encoding="utf-8")

did not fix the issue

victoraugustolls commented 4 years ago

Doing self.conn.setdecoding(pyodbc.SQL_WMETADATA, encoding="utf-8") seems to fix BUT all the columns name are missing characters now :/

v-chojas commented 4 years ago

Please post an ODBC trace. It is hard to determine what is happening without a trace.

reedjosh commented 4 years ago

https://github.com/mkleehammer/pyodbc/issues/489

I see this exact error randomly instead of the garbled messages in issue 489. I believe this is a result of getting an error message half full of arbitrary characters and attempting to decode it. SqlGetDiagRecW seems to be the issue here too.

v-makouz commented 4 years ago

I'm trying to reproduce this, but I can't pip3 install Formatter because of the missing cStringIO

reedjosh commented 4 years ago

It looks to me like Formatter is a class in his local helper lib. When you run pip install Formatter it fails as formatter was only compat with Python 2.

cStringIO was deprecated with python 3. https://stackoverflow.com/questions/57472080/modulenotfounderror-no-module-named-cstringio

formatter itself the package on pypi would also not make sense in the context for which he was using it. https://pypi.org/project/formatter/ It is a code formatter like Black.

reedjosh commented 4 years ago

Further in reviewing...

Since as Victor mentions: self.conn.setdecoding(pyodbc.SQL_WMETADATA, encoding="utf-8")

Fixes the issue, I doubly think this is related to #489

The reason being if you half the size of a utf character then you will only pull half of the number of characters on err. Half is exactly the issue with SqlGetDiagRecW.

reedjosh commented 4 years ago

Also he mentions it's random. The system error which is thrown occurs randomly as the characters that are pulled (when they shouldn't be) and ran through unicode_decode are random.

So, randomly unicode_decode gets a character it can't decode and throws the system error instead of a pyodbc.error.

If I can find a Docker instance for Oracle, I can setup a reproducible test case for you, but I don't use Docker regularly, so it may be a bit before I can get this done.

reedjosh commented 4 years ago

Finally, can we get some motion on #489? Would a PR to repair #489 be more well received if I was able to provide a reproducible test case that proves the two are related?

victoraugustolls commented 4 years ago

Hi! Sorry for not replying earlier. About the formatter, this is what I'm using:

@staticmethod
def format_cursor_mssql(cursor):
    resp_list = []
    names_list = [
        Formatter.camel_case_name(value[0]) for value in cursor.description
    ]

    for row in cursor:
        resp_list.append(dict(zip(names_list, row)))
    return Formatter.clear_nulls(resp_list)

@staticmethod
def camel_case_name(name):
    return re.sub("_.", lambda x: x.group()[1].upper(), name)

@staticmethod
def clear_nulls(dict_list):
    for obj in dict_list:
        for k, v in obj.items():
            if isinstance(v, str) and v.lower() == "null":
                obj[k] = None

    return dict_list

powelleric commented 2 years ago

I'm getting the same error and I'm also using SQLServer (latest version on RDS) and Python 3.10. It's happening to me in Django. Seems to happen to me when the server is under a bit of strain.

My issue was not resolved by trying to set the decoding to utf-8 in my options, as described here:

DATABASES = {
    'default': {
        'ENGINE': 'mssql',
        'NAME': os.environ['METASTORE_DATABASE'],
        'USER': os.environ['METASTORE_USERNAME'],
        'PASSWORD': os.environ['METASTORE_PASSWORD'],
        'HOST': os.environ['METASTORE_SERVER_URL'],
        'OPTIONS': {
            'driver': 'ODBC Driver 17 for SQL Server',
            "setdecoding": [
                {"sqltype": pyodbc.SQL_CHAR, "encoding": 'utf-8'},
                {"sqltype": pyodbc.SQL_WCHAR, "encoding": 'utf-8'}],
            "setencoding": [
                {"encoding": "utf-8"}],
        },
    },
}

isaacnorman82 commented 2 years ago

I've just seen the same issue, airflow 2.2.3, odbc 3.100, pyodbc 4.0.34, python 3.10.5, on wsl ubuntu. MS SQL Server v12.0.2000.8.

Any update on a solution? Issue has been open a couple of years.

Is self.conn.setdecoding(pyodbc.SQL_WMETADATA, encoding="utf-8") the only metadata setting worth forcing to utf-8 so the underlying issue can get logged properly or are there others?

I'm not sure if it was mentioned above, but when I first saw this in an apache airflow task, it just hung the task for an hour until it was noticed.

Edit:

I ran SELECT CONVERT (varchar(256), SERVERPROPERTY('collation')); which gave SQL_Latin1_General_CP1_CI_AS which according to a quick google means:

latin1: makes the server treat strings using charset latin 1, basically ascii CP1: stands for Code Page 1252 CI: case insensitive comparisons so 'ABC' would equal 'abc' AS: accent sensitive, so 'ü' does not equal 'u'

As such it feels like setting a decoding to utf-8 is not the right thing to do as surely this is just negotiated between the db and the odbc client?

v-chojas commented 2 years ago

There are several different places where encodings can be converted, and thus there is also the possibility that the encoding is different in different parts of the system. The pyODBC encoding options are used to set the encoding used between pyODBC and the ODBC driver. You are using MSSQL but I am not sure what "odbc 3.100" is; in any case, the ODBC Driver for SQL Server usually defaults to UTF-8 for its narrow string encoding, although later versions will detect the environment encoding (which is usually UTF-8 on Linux) and use that.

If you would like a better investigation of this issue then post more details about the data which is causing it, along with an ODBC trace and short self-contained repro code.

Also, 4.0.34 version of pyODBC has a major problem with including its own old unixODBC which contains much bugs, use '32 for the time being.

mkleehammer commented 1 year ago

Also, 4.0.34 version of pyODBC has a major problem with including its own old unixODBC which contains much bugs, use '32 for the time being.

I don't remember this. Has this been fixed in 4.0.35?

keitherskine commented 1 year ago

The issue of unixODBC being included in the release wheels was indeed fixed in 4.0.35. Note the customized use of CIBW_REPAIR_WHEEL_COMMAND options in the build script: https://github.com/mkleehammer/pyodbc/blob/master/.github/workflows/artifacts_build.yml#L56-L72

mkleehammer / pyodbc

UTF-16-LE decode error #640

Environment

Issue