mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

[Bug]: Unable to handle double quotes in start of string #57

Closed RaahimSiddiqi closed 3 months ago

RaahimSiddiqi commented 3 months ago

Version of the library

0.25.2

Describe the bug

Working with LLMs (Llama) and having it produce some output in JSON format. There is an edge case I have encountered when working with chinese headings where it will often produce double quotes on the "title" property in the JSON string. This breaks the formatting.

Using the json_repair library should fix this, but instead it returns an empty string in the title.

Output: [{"chapter_id": 1, "starting_time_stamp": "0:00:00", "title": ""}, {"chapter_id": 2, "starting_time_stamp": "0:01:00", "title": ""}, {"chapter_id": 3, "starting_time_stamp": "0:02:00", "title": ""}, {"chapter_id": 4, "starting_time_stamp": "0:04:00", "title": ""}, {"chapter_id": 5, "starting_time_stamp": "0:06:00", "title": ""}, {"chapter_id": 6, "starting_time_stamp": "0:09:00", "title": ""}, {"chapter_id": 7, "starting_time_stamp": "0:11:00", "title": ""}]

How to reproduce

Use the following JSON.

raw_json = """[
  {
    "chapter_id": 1,
    "starting_time_stamp": "0:00:00",
    "title": ""国内苹果用户和安卓用户使用TikTok的各种方法"
  },
  {
    "chapter_id": 2,
    "starting_time_stamp": "0:01:00",
    "title": ""苹果安卓通用最简单的方法"
  },
  {
    "chapter_id": 3,
    "starting_time_stamp": "0:02:00",
    "title": ""不插卡使用"
  },
  {
    "chapter_id": 4,
    "starting_time_stamp": "0:04:00",
    "title": ""免拔卡模式"
  },
  {
    "chapter_id": 5,
    "starting_time_stamp": "0:06:00",
    "title": ""MITM抓包安装支持MITM的旧版TikTok客户端"
  },
  {
    "chapter_id": 6,
    "starting_time_stamp": "0:09:00",
    "title": ""安卓用户使用修改版"
  },
  {
    "chapter_id": 7,
    "starting_time_stamp": "0:11:00",
    "title": ""苹果端无视SIM卡地区限制的第三方修改版"
  }
]"""

Calling code:

valid_json = repair_json(raw_json)
print(valid_json)

Expected behavior

Expected the removal of one the quotes in the starting of the "title" object string.