mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

Adding missing escape for double quote #26

Closed nikolaysm closed 6 months ago

nikolaysm commented 6 months ago

Hi @mangiucugna,

Thank you for your efforts on this. I've encountered a similar issue with the output from the LLM. It seems that the repair_json function isn't handling certain cases correctly.

For instance, when trying to repair the following JSON string:

json_str = '{\n"html": "<h3 id="title">Waarom meer dan 200 Technical Experts - "Passie voor techniek"?</h3>"}'
data = repair_json(json_str, return_objects=True)

The current output is:

{
    'html': '<h3 id=', 
    'techniek': 'h3>',
    'title': u'Waarom meer dan 200 Technical Experts - '
}

However, the expected output should be:

{
    'html': '<h3 id="title">Waarom meer dan 200 Technical Experts - "Passie voor techniek"?</h3>'
}

It seems like the function is having trouble handling certain characters or nested structures properly. Would you mind looking into this further?

Thank you again for your attention to this matter.

_Originally posted by @nikolaysm in https://github.com/mangiucugna/json_repair/issues/20#issuecomment-2066249721_

nikolaysm commented 6 months ago

Hi @nikolaysm can you open a new issue for that? The issue is with the fact that the right format is <h3 id='title'> so I am not 100% sure I can support this use case but is definitely a distinct use case from this issue

_Originally posted by @mangiucugna in https://github.com/mangiucugna/json_repair/issues/20#issuecomment-2066367077_

After removing the attribute id="title", I still experience the same issue. Any suggestions on how to fix the issue with "Passie voor techniek" within the value?

json_str = '{\n"html": "<h3 >Waarom meer dan 200 Technical Experts - "Passie voor techniek"?</h3>"}'
data = repair_json(json_str, return_objects=True)
mangiucugna commented 6 months ago

so this is an entire problem, my idea would be to do something dirty here and use regex (I know I know) to consider anything inside html tags as one string, and then replace the offending quoting characters

nikolaysm commented 6 months ago

We might consider expanding the logic to detect the closing quote.

A similar approach is discussed here: https://github.com/josdejong/jsonrepair/pull/116

mangiucugna commented 6 months ago

good point, let me try something although I am not sure how it would interact with all the other use cases but is worth a try

mangiucugna commented 6 months ago

actually, I noticed that the lib doing that already :/ just that is limited to one use case because I wanted to be safe

mangiucugna commented 6 months ago

Thanks for pointing me in the right direction, I am releasing 0.14.0 with this fix

nikolaysm commented 6 months ago

Top work, @mangiucugna! Thanks for the quick fix!