rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.36k stars 892 forks source link

[FEA] Find a way to support String column input/fixup for JSON parsing #15277

Open revans2 opened 7 months ago

revans2 commented 7 months ago

Is your feature request related to a problem? Please describe. In Spark we have a requirement to be able to pass in a column of strings and parse them as JSON. Ideally we would just pass this directly to CUDF, but none of the input formats really support this, and neither do any of the pre-processing steps that the JSON reader has put in for us. What we do today is first check to see if a line separator (carriage return) is in the data set. If there is one, then we throw an exception. If not, then we concat the lines together into a single buffer with a line separator in between the inputs. (we do some fixup for NULLs/empty rows too).

This has the problem that we throw an exception when we see a bad character in the data, which is valid for Spark to have in the data.

I think that there are a few options that we have to fix this kind of a problem.

  1. Expose the API that removes unneeded white space. We could then remove the unneeded data from the buffer and replace any remaining line separators with '\n' because then they should only be in quoted strings. (we might need to do single quote normalization too because I am not sure which one comes first)
  2. Provide a way to set a different line separator (Ideally something really unlikely to show up NUL \0). This would not fix the problem 100%, but it would make it super rare, and I would feel okay with a solution like this.
  3. Do nothing and we just take the hit when we see a line with this in it. We would then have to pull back those lines to the CPU and process them on the CPU, and push them back to the GPU afterwards.

I personally like option 2, but I am likely to implement option 3 in the short term unless I hear from CUDF that this is simple to do and can be done really quickly.

GregoryKimball commented 1 month ago

@karthikeyann would you please link this issue to your (upcoming) histogram+concat PR in spark-rapids-jni?

karthikeyann commented 1 month ago

This is the PR https://github.com/NVIDIA/spark-rapids-jni/pull/2364