Closed aminland closed 4 months ago
Hi! I actually answered this same problem in #48 but the key issue is that this feature request collides with a different use case:
{ "key": "value", COMMENT FROM LLM "key2": "value2" }
and as far as I can tell, there is no way to distinguish the two use cases with a sure fire heuristics
I haven't looked too deeply at the existing heuristics, but since it's detecting that there is an escapable quote already, maybe if it encounters an escapable quote it can be greedy and look for a second? So essentially, preferring an even number of escaped quotes to odd numbers?
@mangiucugna In this particular case, it seems that the other use case you presented,
{ "key": "value", COMMENT FROM LLM "key2": "value2" }
contains a second key, while this one
{"notes": "Sent a message to the "dictator", waiting on response."}
does not. I think a straightforward solution for the single-key case presented here might be just to look ahead for the existence of another colon after and, if it doesn't exist, then assume there should only be one key (or that this is the final key) and be greedy on the string. (I'll admit I haven't looked through the code (yet) to determine if that would compromise the other use case you presented, but at first glance I don't think it would)
@aminland the code does that already, the fact is that sometimes you can have only a single escapable quote, the logic is already very complicated if you look at the supported use cases in the tests you will see the number of possible repairs around quotes. It seems LLMs really struggle with this :|
@bwest2397 it is in theory possible to support this specific use case only if it's the only key in the object, but I wonder if that would really suffice or this is just a simpler example
An example of a single quote escape that isn't weird is 'isn't?'
except sometimes llms can do much weirder stuff and use markdown or latex and it's hard to understand if a dangling character is valid or not...
so I have a solution for this but it's really narrow:
this will be handled as detailed in this ticket:
{ "notes": "Sent a message to the "dictator", waiting on response."}
this will not:
{ "notes": "Sent a message to the "dictator", waiting "on" response."}
{ "notes": "Sent a message to the "dictator", waiting on response.", "key": "value"}
I wonder if that would really suffice or this is just a simpler example
This is definitely a simplified example :)
Well for now I published 0.21.0 with this small change, at least a step forward
Version of the library
0.20.1
Describe the bug
I see that in #46 you fixed the case where there was a comma after the unescaped quote, but with some text in between the comma and the unescaped quote
But sometimes we get results back from the LLM that look like the following: {"notes": "Sent a message to the "dictator", waiting on response."} which gets truncated: {"notes": "Sent a message to the \"dictator"}
How to reproduce
Add this line to test_object_edge_cases and run test again.
assert repair_json('{"key": "Lorem "ipsum", s"}') == '{"key": "Lorem \\"ipsum\\", s"}'
Expected behavior
I would expect that
repair_json('{"key": "Lorem "ipsum", s"}') == '{"key": "Lorem \\"ipsum\\", s"}'