pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.61k stars 1.99k forks source link

utf8-lossy Encoding Not Working with pl.read_csv #19064

Closed mmantyla closed 1 month ago

mmantyla commented 1 month ago

Checks

Reproducible example

import polars as pl
import os
import csv 
# Ensure this always gets executed in the same location
script_dir = os.path.dirname(os.path.abspath(__file__))
os.chdir(script_dir)

# Create non-utf8 data. 
data = [
    ["Bad utf8 spaces"], # 0xA0 (\xa0): Non-breaking space in ISO-8859-1.
    ["Bad left quation “ mark"], #0x93 (\x93): Left double quotation mark.
    [b'Bad \x85Non-UTF-8 string'.decode('windows-1252', errors='replace')], #\x85: Three dots ...
    ["Bad Chinese: 漢"],
    ["Good row"]
]
filename = "mini.csv"
# Write the data to a CSV file without a header
with open(filename, mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(data)

df = pl.read_csv(filename, has_header=False, schema = None, infer_schema=False, quote_char=None, encoding="utf8-lossy")
#Expected: dataframe with non-utf8 chars replaced
print (df)

Output should have �-characters instead non-utf8.

shape: (5, 1)
┌─────────────────────────┐
│ column_1                │
│ ---                     │
│ str                     │
╞═════════════════════════╡
│ Bad utf8 spaces         │
│ Bad left quation “ mark │
│ Bad …Non-UTF-8 string   │
│ Bad Chinese: 漢         │
│ Good row                │
└─────────────────────────┘

Log output

file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.

Issue description

We use Polars for log analytics due to its efficient handling of large files with millions of lines. You can find our project here: https://github.com/EvoTestOps/LogLead

Recently, we encountered logs containing non-UTF8 characters. To address this, we use polars.read_csv or polars.scan_csv with the encoding="utf8-lossy" option. According to the documentation, "lossy" encoding should replace invalid UTF-8 sequences with the � (replacement) character. This would allow us to easily filter out rows with encoding issues, as non-UTF8 characters cause downstream problems in our pipeline.

However, the utf8-lossy encoding doesn't seem to work as expected.

It would be great to see this either fixed or reflected more clearly in the documentation. Thank you for your time and attention to this issue!

Expected behavior

Output should have �-characters instead non-utf8.

Installed versions

``` --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio numpy openpyxl pandas pyarrow pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter None ```
edwinvehmaanpera commented 1 month ago

I cannot replicate. As far as I can tell the example CSV is valid UTF-8. When I try with invalid UTF-8 the result is as expected. Notice the open(), defaults to system encoding (UTF-8 in my case), so I specify the encoding to be latin-1.

import csv
import polars as pl

filename_non_utf8 = "non_utf8_characters.csv"

rows_with_non_utf8 = [
    [b"Hello \xF0World".decode("latin-1")],
    ["AB\xfc"],
]

with open(filename_non_utf8, mode="w", newline="", encoding="latin-1") as file:
    writer = csv.writer(file)
    writer.writerows(rows_with_non_utf8)

df = pl.read_csv(
    filename_non_utf8,
    has_header=False,
    schema=None,
    infer_schema=False,
    quote_char=None,
    encoding="utf8-lossy",
)
print(df)
shape: (2, 1)
┌──────────────┐
│ column_1     │
│ ---          │
│ str          │
╞══════════════╡
│ Hello �World │
│ AB�          │
└──────────────┘

using encoding="utf8" in read_csv results in Original error: invalid utf-8 sequence

mmantyla commented 1 month ago

As far as I can tell the example CSV is valid UTF-8.

Indeed, you are absolutely correct. It looks like the culprit is our downstream component, which has a poor understanding of UTF-8.