mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
752 stars 42 forks source link

[Bug]: #48

Closed lansespirit closed 4 months ago

lansespirit commented 4 months ago

Version of the library

0.20.1

Describe the bug

When some of the string elements in the list are missing double quotes, the current repair program fixes all the elements as a whole.

I haven't found a suitable solution for this either, I was thinking of using a comma as the beginning of a new list element to judge, but it seems that list elements overlap in a variety of ways, e.g. sometimes a comma separates element delimiters, and sometimes it's a punctuation mark in a sentence.

I wonder if this could be fixed more precisely by adding a more detailed context judgment.

How to reproduce

{ "people": ["Rilee Smith", travel bloggers, Matthias Keller, Ben Harrell"], "additional_research_needed": [ "Current AI trends in the travel industry for 2024.", "User satisfaction and feedback on AI travel planning tools like ChatGPT, Copilot, and Gemini.", "Latest advancements in the AI-driven content marketing landscape." ] }

Expected behavior

Trying to be more precise in determining the contextual conditions of json format.

mangiucugna commented 4 months ago

This would have to be closed as a known behavior. The issue here is to struck a balance between ignoring garbage comments like { "key": "value", sure, here is another example "key2": "value2"} and legitimate use cases like yours. The current heuristics is already pretty lax for example if there was any letter between the comma and the quote it would have repaired it correctly, but any further adjustment will break the feature of ignoring weird comments added by LLMs in between elements...

lansespirit commented 4 months ago

Let's hope llm can give a more standardized json format, there doesn't seem to be a better way at the moment.

However, thank you very much for this json repair module, it helps a lot to parse the json output from llm correctly.