mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

Escaping underscores #19

Closed rbren closed 6 months ago

rbren commented 6 months ago

Describe the bug A clear and concise description of what the bug is.

We have an issue here: https://github.com/OpenDevin/OpenDevin/issues/495

The LLM response tries to escape underscores. So the key new_monologue becomes new\_monologue in the LLM response. json_repair double-escapes the backslash, instead of removing it.

This behavior, where the LLM attempts to escape underscores, seems not uncommon. Maybe we have a special pattern of replacing \_ with _?

Expected behavior Escape characters removed

mangiucugna commented 6 months ago

First of all that json is valid according to the standard BUT json.loads() throws an error while trying to decode it (presumably because a single backslash is invalid in a python string). Following that:

>>> json.dumps({"key\_1":"value"})
'{"key\\\\_1": "value"}'

And that is exactly the output you see here

So I suppose you should probably replace this common occurence yourself before it gets touched by any json library..

mangiucugna commented 6 months ago

A possibility could be to remove all \ that are not used for escaping, but need to think if it can be done safely

mangiucugna commented 6 months ago

I have released 0.11.0 with this fix, let me now if it fixes your issue

rbren commented 6 months ago

🎉 thanks!

I left a comment on the commit--do escaped quotes still work, like

{
  "message": "say \"hello world\""
}
mangiucugna commented 6 months ago

Saw the comment, the escape characters still work but I found a tiny bug in which

{"key\_":"value"}

would break. Now is fixed and updated the unit tests to check for this