rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.28k stars 884 forks source link

[FEA] support '\n', '\r' and '\r\n' atthe same time as line delimiters for CSV parsing #6572

Open revans2 opened 3 years ago

revans2 commented 3 years ago

Is your feature request related to a problem? Please describe. The default setting for Spark when reading CSV for line delimiters is '\r' (Carriage Return), '\n' (Line Feed), and/or '\r\n' (Carriage Return followed by Line Feed)

Currently in the Spark plugin we pre-process the CSV input data before sending it to CUDF for parsing. The pre-processing handles splits to match what Spark currently does and also fixing the line delimiters to be a single uniform value. We have found that with fast storage replacing the line delimiters is a real bottleneck.

We are also concerned about being ready to support GPU Direct Storage where we would not be able to pre-process the data before sending it to cudf.

Describe the solution you'd like We would like an option when parsing CSV to have CUDF recognize '\r' (Carriage Return), '\n' (Line Feed), and '\r\n' (Carriage Return followed by Line Feed) all as valid line delimiters at the same time.

Describe alternatives you've considered Keep doing what we are doing and be slower than ideal when parsing CSV and not be able to support CSV without config modifications when we do adopt GU Direct Storage.

kkraus14 commented 3 years ago

+1 to supporting this from the Python side as well

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.