mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
1.16k stars 65 forks source link

Issue with parsing when there is leading text #35

Closed lujasmine closed 6 months ago

lujasmine commented 6 months ago

Describe the bug Issue with parsing when there is leading text

To Reproduce

json_repair.loads("Based on the information extracted, here is the filled JSON output: ```json { 'a': 'b' } ```")
# this returns the same string inputted to the function

Expected behavior It returns { 'a': 'b' }

I've noticed that the repair works well with trailing text, e.g.

json_repair.loads("```json { 'a': 'b' } ``` This output reflects the information given in the input.")
# returns {'a': 'b'} as expected
mangiucugna commented 6 months ago

Hi @lujasmine thanks for reporting this issue. Your issue triggered a deeper look into how the library handles stray characters and I am releasing 0.16.0 with better handling of those cases and that will fix your case.

However, I don't think this is the right way to handle this on your side. Most LLMs are trained with those token exactly to allow you to isolate the json part of the message and do string manipulation to isolate the relevant part. I have also seen people just removing anything before and after {} if the expect an object.

Up to you of course, but the more you can clean on your side the cleaner the final json

lujasmine commented 6 months ago

That makes sense, I will make sure to do more cleaning on my side!

Thank you so much @mangiucugna!