mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

[Bug]: poor repair from json to just integer #50

Closed pseudotensor closed 4 months ago

pseudotensor commented 4 months ago

Version of the library

0.21.0

Describe the bug

See reproduction

How to reproduce

response0 = """  Here is an example employee profile in JSON format, with keys that are less than 64 characters and made of only alphanumerics, underscores, or hyphens:
```json
{
  "employee_id": 1234,
  "name": "John Doe",
  "email": "johndoe@example.com",
  "job_title": "Software Engineer",
  "department": "Engineering",
  "hire_date": "2020-01-01",
  "salary": 100000,
  "manager_id": 5678
}

In Markdown, you can display this JSON code block like this:

{ "employee_id": 1234, "name": "John Doe", "email": "johndoe@example.com", "job_title": "Software Engineer", "department": "Engineering", "hire_date": "2020-01-01", "salary": 100000, "manager_id": 5678 }

This will display the JSON code block with proper formatting and highlighting.
"""
from json_repair import repair_json
response = repair_json(response0)
print(response)

gives just 64

I don't see why it would choose that integer in first string as valid json.

How to control which part it returns? E.g. if I always was looking for a {} structure, I could hopefully tell repair_json that.

Expected behavior

mangiucugna commented 4 months ago

ah! interesting.

What is happening here is that it will skip everything until it finds something valid, in this case an integer. In that context, anything after that is invalid, hence this result. Obviously from your side there is an easy workaround but I get the point of being able to just pass the entire string instead.

Let me look at this

pseudotensor commented 4 months ago

I can do this:

from json_repair.json_repair import JSONParser
a = JSONParser(response0, None, None)
response = a.parse_object()

~And seems to work, gives object~ No, the hyphens: still ruins that. One gets:

{"hyphens": "json\n{\n  \"employee_id", "name": "John Doe", "email": "johndoe@example.com", "job_title": "Software Engineer", "department": "Engineering", "hire_date": "2020-01-01", "salary": 100000, "manager_id": 5678}
mangiucugna commented 4 months ago

yes because what is happening is that the main parser method finds an integer and, according to the JSON BNF, there can't be anything after an integer. I am addressing this in a more general way.

pseudotensor commented 4 months ago

A good way is to inform json_repair the expected type of structure, since one may be expecting a schema and know.

mangiucugna commented 4 months ago

I think there are two ways, either json_repair ignores the basic types (because otherwise it would have parsed before) or I add some kind of schema. Until now I avoided adding schema validation because honestly is a pain in the ass to support that feature. I think it's easier to avoid parsing the basic types if they are not inside an object or array. This is what I am trying to do now and see what breaks

pseudotensor commented 4 months ago

Cool thanks!