Closed mmantyla closed 1 month ago
I cannot replicate. As far as I can tell the example CSV is valid UTF-8. When I try with invalid UTF-8 the result is as expected. Notice the open()
, defaults to system encoding (UTF-8 in my case), so I specify the encoding to be latin-1.
import csv
import polars as pl
filename_non_utf8 = "non_utf8_characters.csv"
rows_with_non_utf8 = [
[b"Hello \xF0World".decode("latin-1")],
["AB\xfc"],
]
with open(filename_non_utf8, mode="w", newline="", encoding="latin-1") as file:
writer = csv.writer(file)
writer.writerows(rows_with_non_utf8)
df = pl.read_csv(
filename_non_utf8,
has_header=False,
schema=None,
infer_schema=False,
quote_char=None,
encoding="utf8-lossy",
)
print(df)
shape: (2, 1)
┌──────────────┐
│ column_1 │
│ --- │
│ str │
╞══════════════╡
│ Hello �World │
│ AB� │
└──────────────┘
using encoding="utf8"
in read_csv
results in Original error: invalid utf-8 sequence
As far as I can tell the example CSV is valid UTF-8.
Indeed, you are absolutely correct. It looks like the culprit is our downstream component, which has a poor understanding of UTF-8.
Checks
Reproducible example
Output should have �-characters instead non-utf8.
Log output
Issue description
We use Polars for log analytics due to its efficient handling of large files with millions of lines. You can find our project here: https://github.com/EvoTestOps/LogLead
Recently, we encountered logs containing non-UTF8 characters. To address this, we use polars.read_csv or polars.scan_csv with the encoding="utf8-lossy" option. According to the documentation, "lossy" encoding should replace invalid UTF-8 sequences with the � (replacement) character. This would allow us to easily filter out rows with encoding issues, as non-UTF8 characters cause downstream problems in our pipeline.
However, the utf8-lossy encoding doesn't seem to work as expected.
It would be great to see this either fixed or reflected more clearly in the documentation. Thank you for your time and attention to this issue!
Expected behavior
Output should have �-characters instead non-utf8.
Installed versions