Closed jfurtek closed 1 year ago
@vuule This looks like a json reader/libcudf issue. Because I'm not able to repro this by creating a string column from cudf:
>>> import cudf
>>> cdf = cudf.read_json('sample.json', lines=True)
>>> import pandas as pd
>>> df = pd.read_json('sample.json', lines=True)
>>> df
value data
0 16 thirty_chars_plus_two_newlines\n\n
1 19 more_data_with_newline\r\n
>>> cdf
value data
0 16 thirty_chars_plus_two_newlines\n\n
1 19 more_data_with_newline\r\n
>>> cdf.data.str.len()
0 34
1 25
Name: data, dtype: int32
>>> s = cudf.Series(["thirty_chars_plus_two_newlines\n\n", "more_data_with_newline\r\n"])
>>> s
0 thirty_chars_plus_two_newlines\n\n
1 more_data_with_newline\r\n
dtype: object
>>> s[0]
'thirty_chars_plus_two_newlines\n\n'
>>> s.str.len()
0 32
1 24
dtype: int32
cudf is copying the strings from the file as-is. This is the same behavior as the CSV reader, and the Pandas CSV reader (by default).
However, Pandas CSV reader replaces escaped newlines in strings when parameter escapechar
is passed. Looks like Pandas JSON reader implicitly uses escapechar='\'
and this is why the behavior differs from cudf.
Because of the expected performance overhead, cudf CSV reader does not support the escapechar
parameter. Need to evaluate whether we want to pursue parity with Pandas here.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
It seems that this issue is addressed by the new JSON reader
j = '{"value": 16, "data": "thirty_chars_plus_two_newlines\n\n" }\n{"value": 19, "data": "more_data_with_newline\r\n"}'
df = cudf.read_json(j, lines=True, engine='cudf_experimental')
value data
0 16 thirty_chars_plus_two_newlines\n\n
1 19 more_data_with_newline\r\n
I'll close this in favor of #11982
Describe the bug Retrieving strings that contain a newline character from a Series returns a string with an "escaped" backslash (
\\n
) instead of\n
. The cudf output does not match the equivalent pandas output.Steps/Code to reproduce bug 1.) Create a JSON file with the following contents, called test.json:
2.) Create dataframes using both Pandas and cuDF, and compare the output of extracting a single element from a series:
Expected behavior The output should match pandas output.
Environment overview (please complete the following information)
Environment details
Additional context