Open sarahyurick opened 8 months ago
Adding to this, testing functions like to_csv
seem to preserve the unicode encoding as-is while to_json
converts utf-8 chars to the ascii representation before writing. So I'm guessing libcuDF does support directly writing utf-8 (maybe just not within to_json).
Is there any progress? Currently, there is also an exception when saving Chinese text.
No, this isn't something that we have prioritized yet unfortunately.
This feature is easy to implement. It skips the UTF-8/UTF-16 encoding. We need add the options and skip escape_strings_fn
call at cudf/cpp/src/io/json/write_json.cu:548
It's a good first issue.
Describe the bug Ideally, we should eventually support
engine="cudf"
andforce_ascii=False
together withto_json
. For now, we should update the documentation and/or provide a warning for users.Steps/Code to reproduce bug
produces a
TypeError: write_json() got an unexpected keyword argument 'force_ascii'
.I can do a
df.to_json("test.jsonl", orient="records", lines=True, force_ascii=False)
and see the emoji in the.jsonl
file, and I can also do adf.to_json("test.jsonl", orient="records", lines=True, engine="cudf")
and see the emoji represented as "\ud83c\udf31" in the.jsonl
file. But I am unable to see the emoji represented as is in the file, while also writing with the cuDF engine.Environment details I tested this with the latest cuDF version.