Open revans2 opened 2 years ago
I was able to make the test case a lot simpler and still see the same error.
1
2
acb""
4
5
shows the same problems, only the first three lines come out, and the third entry is only abc"
, it is missing the final "
Wow, even if I escape the quotes I still get the problem with the lines, but it "fixes" the issue with the quote at the end being removed, but it does not remove the escapes from the quotes.
1
2
"abc\"\""
4
5
But the output is
+-------+
| _c0|
+-------+
| 1|
| 2|
|abc\"\"|
+-------+
when it should be
+-----+
| _c0|
+-----+
| 1|
| 2|
|abc""|
| 4|
| 5|
+-----+
Oddly if I remove the escapes and just keep the entire thing quoted.
1
2
"abc"""
4
5
It fixes the problem with dropping lines, but it does not fix the single entry.
+----+
| _c0|
+----+
| 1|
| 2|
|abc"|
| 4|
| 5|
+----+
vs from spark
+-----+
| _c0|
+-----+
| 1|
| 2|
|abc""|
| 4|
| 5|
+-----+
This one I am less sure that we have to match exactly what Spark is doing, because pandas matches CUDF in this case. Also pandas does different things for escaped quotes too, so just take these as info for now.
Thanks @revans2 for investigating this. I'm posting a python reproducer:
>>> s = '1\n2\nabc""\n4\n5'
>>> pd.read_csv(StringIO(s), header=None)
0
0 1
1 2
2 abc""
3 4
4 5
>>> cudf.read_csv(StringIO(s), header=None)
0
0 1
1 2
2 abc"
I can't repro the issue in the comment. Trying using Python:
s = '1\n2\n"abc\"\""\n4\n'
But I'm getting the same output as with Pandas (and it looks correct):
1
0 2
1 abc"
2 4
Did some scoping and it seems like this requires large changes in the way the reader finds row offsets. Current state machine has four states (represented by two bits) and handling this would require an additional state, and thus more bits. My main concern is with the work involved to change the way state machine packs and handles the states.
Describe the bug This is directly from https://github.com/NVIDIA/spark-rapids/issues/6435 If you have a field like
abc""
in a CSV file the cudf CSV parser stops processing more data.Steps/Code to reproduce bug Create a file
test.csv
with the following data in it.Now try to read it using CUDF. The last two rows are skipped, and the
acb""
is read back missing the last"
(From spark using the rapids plugin for apache spark)
Without the plugin I get back
Which is also what I get back from pandas.
Expected behavior CUDF returns the same result as Pandas and Spark.