rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.08k stars 875 forks source link

[FEA] JSON validator for json strings given in strings column #12532

Closed ttnghia closed 9 months ago

ttnghia commented 1 year ago

Currently, the input into json reader is just one string. If we have the input json given by a strings column, we will have to concatenate rows of the input columns into one unified json string before parsing it.

A problem emerges when some of the rows of the input strings column contain invalid json string. In such situations, some applications such as Spark just output nulls for these invalid rows. The remaining rows are still being parsed independently and an output table is still generated. On the other hand, cudf's json parser just throws an exception and the entire application crashes.

We can opt to build a json parser that works independently on each string row of the input strings column. However, according to @karthikeyann, this needs big effort to have it. Thus, this is just mentioned but not asked for.

A simpler solution we should concentrate on is to have a json validator that can check each input string if it is a valid json string. The logic behind a validator should be easier to implement than a full parser so it is doable. After quickly validating all input rows, we can identify which rows contain invalid json and will just trim them off the input.

ttnghia commented 1 year ago

CC @karthikeyann, @GregoryKimball and @revans2.

revans2 commented 1 year ago

We want this for JSON lines file too. If the line is bad it should be independent of the other lines in the file. @ttnghia I think this might be related to how you are configuring the reader and concatenating the lines together. In the code you have you are wrapping the entire file in "[" and "]" and inserting in a "," in between each line. If we just configure the reader as JSON lines and insert a "\n" in between each line I think it will do exactly what we want.

ttnghia commented 1 year ago

So far there is no customer screaming about the issue. Probably none of them have to deal with invalid input json yet. However, if any of them stumble on it, they will turn to us.

Thus, supporting this FEA should be a high priority. I believe that this is beneficial not just for us (Spark) but for others to widely use the json reader.

revans2 commented 1 year ago

Yes, I just checked this myself and the experimental parser throws an exception if any of the lines are invalid. This is a major regression compared to the previous JSON lines parser and is going to be a blocker for us being able to adopt it.

elstehle commented 1 year ago

If it was sufficient to just concatenate rows of the input column into a single JSON Lines string, we could provide a mechanism that recovers after an invalid line once entering a new line (i.e., after seeing a \n).

Currently, the JSON tokenizer component defines a state machine that has an error state that is entered once encountering invalid JSON input. The error state is a trap state, once entered, it remains in that state for the remaining input. For your use case, we could provide the possibility to return (to recover) to beginning-of-JSON-value (PD_BOV) state on encountering a \n. Basically, starting to parse the next JSON value. Apart from the state machine, this would require adapting the tree traversal component's handling of ErrorBegin token, which @karthikeyann can talk better about.

revans2 commented 1 year ago

@elstehle that sounds like a great solution when we are in the json lines parsing mode. If you are not in json lines, then you don't have any kind of guarantee like that. But we plan on only using json lines for our parsing so you should check with others to see what kind of requirements they might have for boxing of errors in other modes.

karthikeyann commented 1 year ago

@elstehle For that idea, tree traversal does not need any update. tree generation will need update. Besides, is there a guarentee that the json string will be in "records" (struct) or "values" (list) uniformly for all rows? If that guarentee fails, tree traversal will fail.

@revans2 Actually, the above idea will work if the input string in each row is not in json lines format because we are using the newline to differentiate between different rows. If the json string row itself is jsonlines, this may not work.

@revans2 is it possible for the input json in each string be in "json lines" format? Is there an expected format of json string in each row?

revans2 commented 1 year ago

To be clear there are two ways that spark parses JSON. An input file where we are guaranteed that each record is separated by a line delimiter ('\n' is the one that is the default and is used everywhere, but it is configurable) aka JSON lines. We also support reading in JSON from a string column (as this issue is talking about). In this case there are no explicit guarantees that a '\n' will not show up in the body of the string. But, Spark already has bugs if that happens, so for now we are just going to document that it is not supported, and in the worst case we can detect if it happens ahead of time and throw an error.

For Spark when parsing a file if it is not in json lines format the data is a single json record per file. This is so slow that no one does it. When reading from a string column each string entry is treated as an entire JSON record. So if it were in a format like [{"A": 100},\n{"A": 200}] then it would be an error unless the output type requested was a LIST<STRUCT> or something similar to that. We can cross that bridge when we have a customer that needs something like this.

GregoryKimball commented 1 year ago

Thank you everyone for this discussion.

It sounds like the issue should be re-scoped to "Return null rows when parsing JSON Lines and a line is invalid". Then we would not need a JSON validator tool, and then Spark could configure the strings column as JSON Lines.

Aside from that, @karthikeyann brings up a good point that there are also schema issues for Spark compatibility. One schema issue would be if some lines are object root ({...) and some lines are array root ([...) in the same column. Another schema issue would be having type mismatch for the same key (e.g. {"a":[]}\n{"a":{"b":0}}). If the current libcudf behavior is not going to work for Spark, we'll need to write issues for these cases as well.

ttnghia commented 1 year ago

It sounds like the issue should be re-scoped to "Return null rows when parsing JSON Lines and a line is invalid"

This is not 100% accurate. We also have API like from_json which doesn't parse JSON but just tokenizes the input column. With such use case (and potentially future use case), maybe the issue can be re-scoped to "Return null rows when tokenizing JSON Lines and a line is invalid".

elstehle commented 1 year ago

I've opened this draft PR (https://github.com/rapidsai/cudf/pull/13344) as a first step to address this issue.

I'm currently trying to pin down the exact behaviour for the new option that will recover from invalid JSON lines:

  1. How does Spark currently treat newlines within field names or string values for JSON lines (multiline=false). E.g. {"a\nb":"123\n456"}? Are these three invalid lines: {"a, b":"123, 456"} or just one valid {"a\nb":"123\n456"}?
  2. Is each empty line parsed to a null record (what about consecutive empty lines and the last line)? E.g., will {"a":123}\n\n{"a":456} parse three records with {"a":123}, null, {"a":456}? How about the last JSON line, if it is followed by a newline. E.g. will {"a":123}\n{"a":456}\n parse three records: {"a":123}, {"a":456}, null?
  3. Do we need to differentiate between an invalid JSON line and an empty (null) JSON line for the parsed data? If yes, how would you like to get the invalid lines / rows?
elstehle commented 1 year ago

I ran Spark's JSON reader on the following input:


{"a":123}

{"a
":456}

{"a":123}

Which reads:

+---------------+----+
|_corrupt_record|   a|
+---------------+----+
|           null| 123|
|            {"a|null|
|         ":456}|null|
|           null| 123|
+---------------+----+

So it seems that:

  1. newlines within field names are treated as line delimiters
  2. empty lines are ignored
revans2 commented 1 year ago

So to be clear there are multiple different modes for parsing JSON. By default when you read the JSON from a file all new lines are line delimiters. This is mostly because of how the file can be split at any point, and they decided that splitting files is more important than supporting new lines in the middle of a record.

But if you use from_json to parse the data from a string column it reads it properly. Unfortunately?? I guess. Spark does not really have a way to have white space characters in the names of columns or by default unescaped new lines in string data. So parsing it to get what you want gets a little complicated. Which is partly why it is going to be difficult to get everything to match perfectly. I am doubtful without a specific Spark mode or a lot of knobs to turn that we are going to get this to match perfectly.

cala> val df = Seq("""{"a":123}""","{\"a\n\":456}", """{"a":123}""", "{\"a\"\n:456}").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.selectExpr("from_json(value, \"STRUCT<a: STRING>\")", "from_json(value, \"MAP<STRING, STRING>\")", "value").show()

+----------------+----------+-----------+
|from_json(value)|   entries|      value|
+----------------+----------+-----------+
|           {123}|{a -> 123}|  {"a":123}|
|          {null}|      null|{"a\n":456}|
|           {123}|{a -> 123}|  {"a":123}|
|           {456}|{a -> 456}|{"a"\n:456}|
+----------------+----------+-----------+
elstehle commented 1 year ago

Thanks a lot, @revans2. I think I do have a better understanding of from_json's behaviour now.

I am doubtful without a specific Spark mode or a lot of knobs to turn that we are going to get this to match perfectly.

Spark has a very intricate behaviour, especially around corner cases. From your examples, I understand that from_json behaves as if we were reading each row with multiline=True (I'm extrapolating from the result on the of fourth row), where newlines are treated as whitespace. This raises a conflict with our original idea to concatenate rows with \n and recover from an Error after a newline.

I'm thinking of two alternatives that could work around this issue:

  1. Instead of concatenating the rows by \n, we use another row delimiter, like 0x1E (record separator) to concatenate rows. We recover from an error after such delimiter (0x1E). Newlines are treated as whitespace. If such a delimiter could be part of a row's string, Spark would need to pre-process the strings to escape the delimiter.
  2. Spark needs to pre-process the strings, removing whitespace newlines. This would require distinguishing between newlines enclosed in quotes (strings / field names) and newlines that are just part of whitespace.

I think (1) may be easier, because (2) isn't trivial to pre-process.

Then, we still have this behaviour:

I guess. Spark does not really have a way to have white space characters in the names of columns or by default unescaped new lines in string data.

This is something we don't "fail" for right now. How important would this be for Spark? I'll put some thought to it in the meanwhile.

revans2 commented 1 year ago

Spark has a very intricate behaviour, especially around corner cases. From your examples, I understand that from_json behaves as if we were reading each row with multiline=True (I'm extrapolating from the result on the of fourth row), where newlines are treated as whitespace. This raises a conflict with our original idea to concatenate rows with \n and recover from an Error after a newline.

I agree that we might end up needing a Spark specific mode, which is not really what I think anyone wants. I am hopeful that we can find a small set of control flags that are also common with pandas to allow us to get really close to what Spark does. Then we can do some extra validation checks/cleanup that is Spark specific. Sadly it looks like pandas does not have that many config options for JSON parsing so it probably would involve more pre-processing and cleanup.

The new lines is the hardest part, so we might need to do some special things for it. Even then, I don't think it is as bad as we think. By default in Spark a newline is not allowed inside of a value or key.

We can fall back to the CPU if this is set to true. In the common case we can write a custom kernel, or use a regular expression to scan for the bad case and know which rows have the error in it. We can then replace/normalize the white space in the string before doing the concat.

I think the kernel should not be too bad. The hardest part would be making it fast for long strings, but I think even then we can have less than optimal performance for highly escaped quotes \\\\\\\\\" so long as the common case is still fast.

I guess. Spark does not really have a way to have white space characters in the names of columns or by default unescaped new lines in string data.

This is something we don't "fail" for right now. How important would this be for Spark? I'll put some thought to it in the meanwhile.

Like I said I think for each difference with Spark we need to decide if this should be common (Pandas and or others are likely to have the same problem) or if it should be done by the Spark team as pre and/or post processing.

https://spark.apache.org/docs/latest/sql-data-sources-json.html are the configs that Spark supports right now.

All of the configs around parsing primitive data types we would just ask CUDF to return the values as strings and we would handle any custom kernels to do the parsing/validation These include timeZone, allowNumericLeadingZeros, dateFormat, timestampFormat, timestampNTZFormat, enabledDateTimeParsingFallback, and allowNonNumericNumbers

For many others we can fall back to the CPU if we see them because the default values are things that I think CUDF can support out of the box, but I need to validate that. allowComments, allowUnquotedFieldNames, multiLine, encoding, lineSep, mode, and columnNameOfCorruptRecord

Some we can just ignore because they deal with inferring the data types, and we don't support that on the GPU right now. primitivesAsString, samplingRatio, dropFieldIfAllNull and prefersDecimal

Others I think we can support them with a combination of pre/post processing on the input itself allowUnquotedControlCharacters and allowBackslashEscapingAnyCharater. These both really come down to a kernel that can detect the bad cases in the input rows, which mostly involves tracking quotes and escapes, and possibly a way to fix up the input so CUDF is happy.

And finally some I think we might need some help from CUDF on. allowSingleQuotes

The other places that we need the help are recovering from badly formatted JSON and for some really odd corner cases if we want to be 100% identical to spark a way to know if the input was quoted or not when it was returned to us as a string. The second part is because if Spark sees a value of 12.3450 with no quotes, but the user asks for a String type it will first parse the data as a floating point value and then convert that back to a string. That I think is super minor and is something that I am happy to live with as an compatibility for now.

GregoryKimball commented 9 months ago

I would like to close this issue in favor of #13525