mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

[Bug]: unescaped quotes immediately followed by comma fail. #49

Closed aminland closed 4 months ago

aminland commented 4 months ago

Version of the library

0.20.1

Describe the bug

I see that in #46 you fixed the case where there was a comma after the unescaped quote, but with some text in between the comma and the unescaped quote

But sometimes we get results back from the LLM that look like the following: {"notes": "Sent a message to the "dictator", waiting on response."} which gets truncated: {"notes": "Sent a message to the \"dictator"}

How to reproduce

Add this line to test_object_edge_cases and run test again.

assert repair_json('{"key": "Lorem "ipsum", s"}') == '{"key": "Lorem \\"ipsum\\", s"}'

image

Expected behavior

I would expect that repair_json('{"key": "Lorem "ipsum", s"}') == '{"key": "Lorem \\"ipsum\\", s"}'

mangiucugna commented 4 months ago

Hi! I actually answered this same problem in #48 but the key issue is that this feature request collides with a different use case:

{ "key": "value", COMMENT FROM LLM "key2": "value2" }

and as far as I can tell, there is no way to distinguish the two use cases with a sure fire heuristics

aminland commented 4 months ago

I haven't looked too deeply at the existing heuristics, but since it's detecting that there is an escapable quote already, maybe if it encounters an escapable quote it can be greedy and look for a second? So essentially, preferring an even number of escaped quotes to odd numbers?

bwest2397 commented 4 months ago

@mangiucugna In this particular case, it seems that the other use case you presented,

{ "key": "value", COMMENT FROM LLM "key2": "value2" }

contains a second key, while this one

{"notes": "Sent a message to the "dictator", waiting on response."}

does not. I think a straightforward solution for the single-key case presented here might be just to look ahead for the existence of another colon after and, if it doesn't exist, then assume there should only be one key (or that this is the final key) and be greedy on the string. (I'll admit I haven't looked through the code (yet) to determine if that would compromise the other use case you presented, but at first glance I don't think it would)

mangiucugna commented 4 months ago

@aminland the code does that already, the fact is that sometimes you can have only a single escapable quote, the logic is already very complicated if you look at the supported use cases in the tests you will see the number of possible repairs around quotes. It seems LLMs really struggle with this :|

@bwest2397 it is in theory possible to support this specific use case only if it's the only key in the object, but I wonder if that would really suffice or this is just a simpler example

mangiucugna commented 4 months ago

An example of a single quote escape that isn't weird is 'isn't?' except sometimes llms can do much weirder stuff and use markdown or latex and it's hard to understand if a dangling character is valid or not...

mangiucugna commented 4 months ago

so I have a solution for this but it's really narrow:

this will be handled as detailed in this ticket: { "notes": "Sent a message to the "dictator", waiting on response."}

this will not: { "notes": "Sent a message to the "dictator", waiting "on" response."} { "notes": "Sent a message to the "dictator", waiting on response.", "key": "value"}

aminland commented 4 months ago

I wonder if that would really suffice or this is just a simpler example

This is definitely a simplified example :)

mangiucugna commented 4 months ago

Well for now I published 0.21.0 with this small change, at least a step forward