mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

[Bug]: Failed repair on some quote cases #56

Closed Robin-Dong closed 3 months ago

Robin-Dong commented 3 months ago

Version of the library

0.25.2

Describe the bug

As shown by the cases below, IDs 1, 4, and 5 failed during the repair.

input: {"na"me": "Jack O"Sullivan", "id": "1"}
output: {"na": "e", "Jack O": "ullivan", "id": "1"}
------------
input: {"name": "Jack: The "OG" O"Sullivan"", "id": "2"}
output: {"name": "Jack: The \"OG\" O\"Sullivan\"", "id": "2"}
------------
input: {"name": "Jack: The "OG"", "surname": 'O'Sullivan', "id": "3"}
output: {"name": "Jack: The \"OG\"", "surname": "O'Sullivan", "id": "3"}
------------
input: {"test_str": {"1singlechar": "a""a""a", "2singlechars": "a"a"a"a"a"a"a"a"a"}, "id": "4"}
output: {"test_str": {"1singlechar": "a\"", "a": "a", "2singlechars": "a\"a\"a\"a\"a\"a\"a\"a\"a"}, "id": "4"}
------------
input: {'name': 'Jack O'Sullivan, 'id': '5'}
output: {"name": "Jack O", "id": "5"}
------------

How to reproduce

from json_repair import repair_json

req_jsons = [
    '{"na"me": "Jack O"Sullivan", "id": "1"}',
    '{"name": "Jack: The "OG" O"Sullivan"", "id": "2"}',
    '{"name": "Jack: The "OG"", "surname": \'O\'Sullivan\', "id": "3"}',
    '{"test_str": {"1singlechar": "a""a""a", "2singlechars": "a"a"a"a"a"a"a"a"a"}, "id": "4"}',
    "{'name': 'Jack O'Sullivan, 'id': '5'}",
]

for bad_json_string in req_jsons:
    good_json_string = repair_json(bad_json_string, skip_json_loads=True)
    print(f"input: {bad_json_string}\noutput: {good_json_string}")
    print("------------")

Expected behavior

input: {"na"me": "Jack O"Sullivan", "id": "1"}
output: {"na\me": "Jack O\"Sullivan", "id": "1"}
------------
input: {"name": "Jack: The "OG" O"Sullivan"", "id": "2"}
output: {"name": "Jack: The \"OG\" O\"Sullivan\"", "id": "2"}
------------
input: {"name": "Jack: The "OG"", "surname": 'O'Sullivan', "id": "3"}
output: {"name": "Jack: The \"OG\"", "surname": "O'Sullivan", "id": "3"}
------------
input: {"test_str": {"1singlechar": "a""a""a", "2singlechars": "a"a"a"a"a"a"a"a"a"}, "id": "4"}
output: {"test_str": {"1singlechar": "a\"\"a\"\"a", "2singlechars": "a\"a\"a\"a\"a\"a\"a\"a\"a"}, "id": "4"}
------------
input: {'name': 'Jack O'Sullivan, 'id': '5'}
output: {"name": "Jack O'Sullivan", "id": "5"}
mangiucugna commented 3 months ago

Hi, those are all tricky cases that clash with other requirements (most notably the need to remove stray LLM comments from objects). Which LLM generated those?