mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
1.16k stars 65 forks source link

[Bug]: Strings containing unescaped quotes followed by commas are incorrectly truncated #46

Closed bwest2397 closed 5 months ago

bwest2397 commented 5 months ago

Version of the library

0.19.2

Describe the bug

Within a string with an unescaped quote followed at a later point by a comma, the string gets truncated after the second " character in the unescaped quote within the string. If this string is at the end of the JSON object and the string is not immediately followed by } (i.e. is followed by whitespace or e.g. a comma), then the final word in the string is parsed as a key with an empty (string) value.

This seems to relate to https://github.com/mangiucugna/json_repair/issues/44, but it seems the attempted fix for that bug report didn't fully resolve this.

How to reproduce

(Note, I've formatted the recovered/output JSON just to make it more readable)

For

>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint, suntid est laborum"}')

the recovered JSON is:

{
  "lorem": "Lorem \"ipsum"
}

For any of the following examples

>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint, suntid est laborum" }')
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint, suntid est laborum"\n}')
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint, suntid est laborum",}')

the recovered JSON is:

{
  "lorem": "Lorem \"ipsum",
  "laborum": ""
}

Removing the comma, the output matches what we'd expect:

>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint suntid est laborum"}')
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint suntid est laborum" }')

yields

{
  "lorem": "Lorem \"ipsum\" excepteur sint suntid est laborum"
}

Expected behavior

>>> print(repair_json('{"lorem": "Lorem "ipsum" excepteur sint suntid est laborum"}'))
{"lorem": "Lorem \"ipsum\" excepteur sint, suntid est laborum"}

>>> print(repair_json('{"lorem": "Lorem "ipsum" excepteur sint suntid est laborum" }'))
{"lorem": "Lorem \"ipsum\" excepteur sint, suntid est laborum"}
mangiucugna commented 5 months ago

This was tough because the library is actually acting as expected, I found a workaround that I am releasing now but is an unstable equilibrium when it comes to wrong delimiters because there are a million corner cases that can go wrong. Nonetheless the solution I found seems to be working and passes all tests.

bwest2397 commented 5 months ago

Awesome, thanks! I tested the new release with some samples I had and they seem to work 👍