pallets / werkzeug

The comprehensive WSGI web application library.
https://werkzeug.palletsprojects.com
BSD 3-Clause "New" or "Revised" License
6.63k stars 1.73k forks source link

Download to CSV shows a special Chinese characters #2922

Closed Habeeb556 closed 2 months ago

Habeeb556 commented 2 months ago

Download to CSV shows a special Chinese characters

This bug is related to Apache Superset. When I try to download a CSV query containing non-English characters, I get special Chinese characters as shown in the attached example: https://github.com/apache/superset/pull/29506. Even setting the export configuration as follows:

csv_data = df_to_escaped_csv(df, index=False, encoding='utf-8', **config["CSV_EXPORT"])

This behavior didn't exist in Werkzeug version 2.3.8, but starting from version 3.x we encountered this error.

Environment:

ThiefMaster commented 2 months ago

Please provide a minimal working example to reproduce this. Not just a line of code w/o any context.

Habeeb556 commented 2 months ago

@ThiefMaster here is the application log:

Fetching CSV from results backend [29c6731c-f1d2-4636-9e1a-80889fac9d55]
2024-07-09 12:25:32,657:INFO:superset.commands.sql_lab.export:Fetching CSV from results backend [29c6731c-f1d2-4636-9e1a-80889fac9d55]
Decompressing
2024-07-09 12:25:32,659:INFO:superset.commands.sql_lab.export:Decompressing
Using pandas to convert to CSV
2024-07-09 12:25:32,916:INFO:superset.commands.sql_lab.export:Using pandas to convert to CSV

And the export.py file that converts the data to CSV.

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
from __future__ import annotations

import logging
from typing import Any, cast, TypedDict

import pandas as pd
from flask_babel import gettext as __

from superset import app, db, results_backend, results_backend_use_msgpack
from superset.commands.base import BaseCommand
from superset.errors import ErrorLevel, SupersetError, SupersetErrorType
from superset.exceptions import SupersetErrorException, SupersetSecurityException
from superset.models.sql_lab import Query
from superset.sql_parse import ParsedQuery
from superset.sqllab.limiting_factor import LimitingFactor
from superset.utils import core as utils, csv
from superset.views.utils import _deserialize_results_payload

config = app.config

logger = logging.getLogger(__name__)

class SqlExportResult(TypedDict):
    query: Query
    count: int
    data: list[Any]

class SqlResultExportCommand(BaseCommand):
    _client_id: str
    _query: Query

    def __init__(
        self,
        client_id: str,
    ) -> None:
        self._client_id = client_id

    def validate(self) -> None:
        self._query = (
            db.session.query(Query).filter_by(client_id=self._client_id).one_or_none()
        )
        if self._query is None:
            raise SupersetErrorException(
                SupersetError(
                    message=__(
                        "The query associated with these results could not be found. "
                        "You need to re-run the original query."
                    ),
                    error_type=SupersetErrorType.RESULTS_BACKEND_ERROR,
                    level=ErrorLevel.ERROR,
                ),
                status=404,
            )

        try:
            self._query.raise_for_access()
        except SupersetSecurityException as ex:
            raise SupersetErrorException(
                SupersetError(
                    message=__("Cannot access the query"),
                    error_type=SupersetErrorType.QUERY_SECURITY_ACCESS_ERROR,
                    level=ErrorLevel.ERROR,
                ),
                status=403,
            ) from ex

    def run(
        self,
    ) -> SqlExportResult:
        self.validate()
        blob = None
        if results_backend and self._query.results_key:
            logger.info(
                "Fetching CSV from results backend [%s]", self._query.results_key
            )
            blob = results_backend.get(self._query.results_key)
        if blob:
            logger.info("Decompressing")
            payload = utils.zlib_decompress(
                blob, decode=not results_backend_use_msgpack
            )
            obj = _deserialize_results_payload(
                payload, self._query, cast(bool, results_backend_use_msgpack)
            )

            df = pd.DataFrame(
                data=obj["data"],
                dtype=object,
                columns=[c["name"] for c in obj["columns"]],
            )

            logger.info("Using pandas to convert to CSV")
        else:
            logger.info("Running a query to turn into CSV")
            if self._query.select_sql:
                sql = self._query.select_sql
                limit = None
            else:
                sql = self._query.executed_sql
                limit = ParsedQuery(
                    sql,
                    engine=self._query.database.db_engine_spec.engine,
                ).limit
            if limit is not None and self._query.limiting_factor in {
                LimitingFactor.QUERY,
                LimitingFactor.DROPDOWN,
                LimitingFactor.QUERY_AND_DROPDOWN,
            }:
                # remove extra row from `increased_limit`
                limit -= 1
            df = self._query.database.get_df(sql, self._query.schema)[:limit]

        csv_data = csv.df_to_escaped_csv(df, index=False, **config["CSV_EXPORT"])

        return {
            "query": self._query,
            "count": len(df.index),
            "data": csv_data,
        }
Habeeb556 commented 2 months ago

The data appeared when exported CSV like this in version 3.x:

image

However in version 2.x, it looks like this:

image

davidism commented 2 months ago

We need a minimal reproducible example demonstrating that the issue is in Werkzeug. We can't debug Apache Superset or this complicated example. This also looks like an encoding issue either during writing the file or opening in Excel, which is not handled by Werkzeug.

mistercrunch commented 2 months ago

We need a minimal reproducible example demonstrating that the issue is in Werkzeug

+1, the current issue with Werkzeug refers to Superset-specific code and they won't be able to repro/fix. Though we may get lucky and they might point to some 3.x breaking change they knew might cause this kind of issue. In any case, it'd be good to get to the specifics of where the behavior is different from what you'd expect in Werkzeug itself.

Habeeb556 commented 2 months ago

This is strange. To reproduce the issue, I did the following:

  1. Installed Werkzeug version 3.0.3 with pip install Werkzeug==3.0.3.
  2. Restarted Superset without making any changes, get Chinese characters.
  3. Installed Werkzeug version 2.3.8 with pip install Werkzeug==2.3.8.
  4. Restarted Superset again, and the format became correct.
mistercrunch commented 2 months ago

Right, though the "minimal reproducible example" would figure out exactly what in exact method/feature we're using from Werkzeug that has changed behavior over those versions, and remove all the related Superset-specific logic. Without knowing what cause the change of behavior, it's hard to even assert that the 2.3.8 version is the correct behavior. It may happen to be the desired behavior for you in your particular use case, but doesn't mean it's right.