rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.46k stars 908 forks source link

[BUG] Add support for `force_ascii=False` when writing to JSON with cuDF engine #15211

Open sarahyurick opened 8 months ago

sarahyurick commented 8 months ago

Describe the bug Ideally, we should eventually support engine="cudf" and force_ascii=False together with to_json. For now, we should update the documentation and/or provide a warning for users.

Steps/Code to reproduce bug

import cudf

df = cudf.DataFrame({"a": [1,2,3], "b": ["4","5","🌱"]})
df.to_json("test.jsonl", orient="records", lines=True, engine="cudf", force_ascii=False)

produces a TypeError: write_json() got an unexpected keyword argument 'force_ascii'.

I can do a df.to_json("test.jsonl", orient="records", lines=True, force_ascii=False) and see the emoji in the .jsonl file, and I can also do a df.to_json("test.jsonl", orient="records", lines=True, engine="cudf") and see the emoji represented as "\ud83c\udf31" in the .jsonl file. But I am unable to see the emoji represented as is in the file, while also writing with the cuDF engine.

Environment details I tested this with the latest cuDF version.

ayushdg commented 8 months ago

Adding to this, testing functions like to_csv seem to preserve the unicode encoding as-is while to_json converts utf-8 chars to the ascii representation before writing. So I'm guessing libcuDF does support directly writing utf-8 (maybe just not within to_json).

simplew2011 commented 5 months ago

Is there any progress? Currently, there is also an exception when saving Chinese text.

vyasr commented 5 months ago

No, this isn't something that we have prioritized yet unfortunately.

karthikeyann commented 1 month ago

This feature is easy to implement. It skips the UTF-8/UTF-16 encoding. We need add the options and skip escape_strings_fn call at cudf/cpp/src/io/json/write_json.cu:548 It's a good first issue.