josdejong / jsonrepair

Repair invalid JSON documents
https://josdejong.github.io/jsonrepair/
Other
479 stars 32 forks source link

Failing to fix a json string containing parenthesis `(` or `)` #126

Open tybalex opened 4 months ago

tybalex commented 4 months ago

The Problem When I tried to use the library to fix the following string, it failed: Input:

{"name": "run", "args": {"param1": ["This is C(2)", "This is F(3)]}}

Output:

Error: Object key expected at position 62

======================== However without the '(' or ')' char, it can produce a correct fix: Input:

{"name": "run", "args": {"param1": ["This is C(2)", "This is F3]}}

Output:

{"name": "run", "args": {"param1": ["This is C(2)", "This is F3"]}}

Is this behavior expected or it is a bug?

tybalex commented 4 months ago

Another example: Input:

{"name": "run", "args": {"param1": ["This is C(2)", This is F(3)]}}

Output:

{"name": "run", "args": {"param1": ["This is C(2)", 3]}}
josdejong commented 4 months ago

Thanks for your input.

Just curious: did you encounter this broken JSON in a real world example, or did you make it up?

The limitation originates in the code that identifies the end of the string when the end quote is missing. It currently stops at the first next delimiter, including ( and ). That is needed to identify MongoDB data types and JSONP notation. To prevent for example This is F(3) to be identified as a MongoDB/JSONP function and replaced with 3, we should refine the logic, for example by checking abcense of spaces in the name, and/or checking against a list with known MongoDB data types.

josdejong commented 4 months ago

Via 58fe64ce62d7a3840f427df2dabb1fa748e540de I've made the detection of MongoDB/JSONP function calls more robust. This solves the issue of jsonrepair silently changing an unquoted string containing parenthesis, the library now consistenly throws an exception.

The fix is not yet published.

Repairing an unquoted string containing parenthesis would be a next step.

tybalex commented 4 months ago

Thanks for your input.

Just curious: did you encounter this broken JSON in a real world example, or did you make it up?

The limitation originates in the code that identifies the end of the string when the end quote is missing. It currently stops at the first next delimiter, including ( and ). That is needed to identify MongoDB data types and JSONP notation. To prevent for example This is F(3) to be identified as a MongoDB/JSONP function and replaced with 3, we should refine the logic, for example by checking abcense of spaces in the name, and/or checking against a list with known MongoDB data types.

Hi @josdejong , thank you for the quick response! I really like this tool.

To answer your question: This is a real world example -- it is produced by an LLM(large language model) we trained, and I was trying to use jsonrepair to fix some of the broken json produced by the LLM.

josdejong commented 4 months ago

Thanks, good to know.