rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 900 forks source link

[BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948

Open revans2 opened 2 years ago

revans2 commented 2 years ago

Describe the bug This is directly from https://github.com/NVIDIA/spark-rapids/issues/6435 If you have a field like abc"" in a CSV file the cudf CSV parser stops processing more data.

Steps/Code to reproduce bug Create a file test.csv with the following data in it.

134324937434,#1991 N Grayhawk,"",Menlo Park,89025,AB,United States,US
208564744937,"63,trevion Way","",st Lothian,h7f4h8,"",United Kingdom,GB
132709376823,16 Oakland PARK RD,"",ring,l1w1e4,South,Canada,CA
224867848652,7 kingwell Court,"",United,s7jd9,South United,United Kingdom,GB
169636884295,30 cartuja Road,"",Halifax,L0R 9p2,ON,Canada,CA
859473321609,Street,"",Manchester,92220,OR,United States,US
141096112545,99 rue des,"",Australia,jsd9je,"",France,FR
160397658930,5 Rise,"",walligshngton,RY6 8LT,FORT,United Kingdom,GB
726367494002,1852 Townsend st,666,Wallsend,90382,CA,United States,US
187644735867,Bärbel-HAMPDEN-Ping 37,"",Miami,13355,"",Kingdom,ZZ
948475348324,155 sw City ct,Rochdale,Germany,30864,FL,Australia,QQ
164083193213,abc"","",Jerez Fra.,11401,Cadiz,Spain,ES
198732413077,3p Grove Rochdale road,BAW,Fulifax,HX4 trW,"",Israel,GB
227433927227,95 novem blvd,"",RAW VILLAGE,3173,XYZ,Australia,IL

Now try to read it using CUDF. The last two rows are skipped, and the acb"" is read back missing the last "

(From spark using the rapids plugin for apache spark)

+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|         _c0|                 _c1|     _c2|          _c3|    _c4|         _c5|           _c6|_c7|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|134324937434|    #1991 N Grayhawk|    null|   Menlo Park|  89025|          AB| United States| US|
|208564744937|      63,trevion Way|    null|   st Lothian| h7f4h8|        null|United Kingdom| GB|
|132709376823|  16 Oakland PARK RD|    null|         ring| l1w1e4|       South|        Canada| CA|
|224867848652|    7 kingwell Court|    null|       United|  s7jd9|South United|United Kingdom| GB|
|169636884295|     30 cartuja Road|    null|      Halifax|L0R 9p2|          ON|        Canada| CA|
|859473321609|              Street|    null|   Manchester|  92220|          OR| United States| US|
|141096112545|          99 rue des|    null|    Australia| jsd9je|        null|        France| FR|
|160397658930|              5 Rise|    null|walligshngton|RY6 8LT|        FORT|United Kingdom| GB|
|726367494002|    1852 Townsend st|     666|     Wallsend|  90382|          CA| United States| US|
|187644735867|Bärbel-HAMPDEN-Pi...|    null|        Miami|  13355|        null|       Kingdom| ZZ|
|948475348324|      155 sw City ct|Rochdale|      Germany|  30864|          FL|     Australia| QQ|
|164083193213|                abc"|    null|   Jerez Fra.|  11401|       Cadiz|         Spain| ES|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+

Without the plugin I get back

+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|         _c0|                 _c1|     _c2|          _c3|    _c4|         _c5|           _c6|_c7|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|134324937434|    #1991 N Grayhawk|    null|   Menlo Park|  89025|          AB| United States| US|
|208564744937|      63,trevion Way|    null|   st Lothian| h7f4h8|        null|United Kingdom| GB|
|132709376823|  16 Oakland PARK RD|    null|         ring| l1w1e4|       South|        Canada| CA|
|224867848652|    7 kingwell Court|    null|       United|  s7jd9|South United|United Kingdom| GB|
|169636884295|     30 cartuja Road|    null|      Halifax|L0R 9p2|          ON|        Canada| CA|
|859473321609|              Street|    null|   Manchester|  92220|          OR| United States| US|
|141096112545|          99 rue des|    null|    Australia| jsd9je|        null|        France| FR|
|160397658930|              5 Rise|    null|walligshngton|RY6 8LT|        FORT|United Kingdom| GB|
|726367494002|    1852 Townsend st|     666|     Wallsend|  90382|          CA| United States| US|
|187644735867|Bärbel-HAMPDEN-Pi...|    null|        Miami|  13355|        null|       Kingdom| ZZ|
|948475348324|      155 sw City ct|Rochdale|      Germany|  30864|          FL|     Australia| QQ|
|164083193213|               abc""|    null|   Jerez Fra.|  11401|       Cadiz|         Spain| ES|
|198732413077|3p Grove Rochdale...|     BAW|      Fulifax|HX4 trW|        null|        Israel| GB|
|227433927227|       95 novem blvd|    null|  RAW VILLAGE|   3173|         XYZ|     Australia| IL|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+

Which is also what I get back from pandas.

>>> pd.read_csv("./test.csv", header=None)
               0                       1         2              3        4             5               6   7
0   134324937434        #1991 N Grayhawk       NaN     Menlo Park    89025            AB   United States  US
1   208564744937          63,trevion Way       NaN     st Lothian   h7f4h8           NaN  United Kingdom  GB
2   132709376823      16 Oakland PARK RD       NaN           ring   l1w1e4         South          Canada  CA
3   224867848652        7 kingwell Court       NaN         United    s7jd9  South United  United Kingdom  GB
4   169636884295         30 cartuja Road       NaN        Halifax  L0R 9p2            ON          Canada  CA
5   859473321609                  Street       NaN     Manchester    92220            OR   United States  US
6   141096112545              99 rue des       NaN      Australia   jsd9je           NaN          France  FR
7   160397658930                  5 Rise       NaN  walligshngton  RY6 8LT          FORT  United Kingdom  GB
8   726367494002        1852 Townsend st       666       Wallsend    90382            CA   United States  US
9   187644735867  Bärbel-HAMPDEN-Ping 37       NaN          Miami    13355           NaN         Kingdom  ZZ
10  948475348324          155 sw City ct  Rochdale        Germany    30864            FL       Australia  QQ
11  164083193213                   abc""       NaN     Jerez Fra.    11401         Cadiz           Spain  ES
12  198732413077  3p Grove Rochdale road       BAW        Fulifax  HX4 trW           NaN          Israel  GB
13  227433927227           95 novem blvd       NaN    RAW VILLAGE     3173           XYZ       Australia  IL

Expected behavior CUDF returns the same result as Pandas and Spark.

revans2 commented 2 years ago

I was able to make the test case a lot simpler and still see the same error.

1
2
acb""
4
5

shows the same problems, only the first three lines come out, and the third entry is only abc", it is missing the final "

revans2 commented 2 years ago

Wow, even if I escape the quotes I still get the problem with the lines, but it "fixes" the issue with the quote at the end being removed, but it does not remove the escapes from the quotes.

1
2
"abc\"\""
4
5

But the output is

+-------+
|    _c0|
+-------+
|      1|
|      2|
|abc\"\"|
+-------+

when it should be

+-----+
|  _c0|
+-----+
|    1|
|    2|
|abc""|
|    4|
|    5|
+-----+

Oddly if I remove the escapes and just keep the entire thing quoted.

1
2
"abc"""
4
5

It fixes the problem with dropping lines, but it does not fix the single entry.

+----+
| _c0|
+----+
|   1|
|   2|
|abc"|
|   4|
|   5|
+----+

vs from spark

+-----+
|  _c0|
+-----+
|    1|
|    2|
|abc""|
|    4|
|    5|
+-----+

This one I am less sure that we have to match exactly what Spark is doing, because pandas matches CUDF in this case. Also pandas does different things for escaped quotes too, so just take these as info for now.

GregoryKimball commented 2 years ago

Thanks @revans2 for investigating this. I'm posting a python reproducer:

>>> s = '1\n2\nabc""\n4\n5'
>>> pd.read_csv(StringIO(s), header=None)
       0
0      1
1      2
2  abc""
3      4
4      5
>>> cudf.read_csv(StringIO(s), header=None)
      0
0     1
1     2
2  abc"
vuule commented 2 years ago

I can't repro the issue in the comment. Trying using Python: s = '1\n2\n"abc\"\""\n4\n' But I'm getting the same output as with Pandas (and it looks correct):

      1
0     2
1  abc"
2     4
vuule commented 1 year ago

Did some scoping and it seems like this requires large changes in the way the reader finds row offsets. Current state machine has four states (represented by two bits) and handling this would require an additional state, and thus more bits. My main concern is with the work involved to change the way state machine packs and handles the states.