mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

repair_json incorrectly truncates JSON strings with escaped quotes and commas #44

Closed mlxyz closed 5 months ago

mlxyz commented 5 months ago

Describe the bug The valid JSON {"foo": "bar \"foo\", baz"} gets turned into the broken JSON {"foo": "bar \\"foo"} when using repair_json. I think this is related to the escaped quotes and comma.

To Reproduce Steps to reproduce the behavior:

  1. Call repair_json('{"foo": "bar \"foo\", baz"}')
  2. Check the output {"foo": "bar \\"foo"}

Expected behavior Correct output {"foo": "bar \"foo\", baz"}

mangiucugna commented 5 months ago

so this is a tough one because if you pass that string to python without using r"" the string passed is: {"foo": "bar "foo", baz"} and the result is correct because it's impossible to know in advance if that comma closes the key/value pair or is part of the string.

if you call repair_json(r'{"foo": "bar \"foo\", baz"}') the result is correctly {"foo": "bar \"foo\", baz"}

mlxyz commented 5 months ago

Ah, got it - thanks!

I thought my problem was related to escaped quotes but apparently it is not and now i can't seem to figure it out: I would expect {"foo": "bar \"e\", .", "g": ["h:"]]} to get repaired to {"foo": "bar \"e\", .", "g": ["h:"]} but repair_json(r'{"foo": "bar \"e\", .", "g": ["h:"]]}') returns {"foo": "bar \\\"e\\", ",": ": [\"h:"}.

Any idea why?

mangiucugna commented 5 months ago

I hate escaping problems, I will push a new version in a few minutes with what is hopefully a definitive fix