mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
692 stars 40 forks source link

[Bug]: Failing to parse truncated JSON (due to LLM repetition and max_tokens) #59

Closed thigger closed 1 month ago

thigger commented 1 month ago

Version of the library

0.25.3

Describe the bug

Not sure whether to call this a bug or a feature request. Some models have a habit of getting into loops (Llama-3.1 in this case) so the output gets truncated by max_tokens and the JSON is borked. I'd say there were two issues here - one that it's not parsing the quotes correctly (look at the single vs double quotes in its output compared with the input) and secondly that it's not managing to include much of the input JSON. Is it possible to parse this?

How to reproduce

LLM output: { "text description" : "subcutaneous oxycodone",\n"terms" : [\n {"term": "Localized swelling, mass and lump of skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of head, face and neck", "score": 0},\n {"term": "Localized hyperhidrosis", "score": 0},\n {"term": "Excessive and redundant skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of other and unspecified sites", "score": 0},\n {"term": "Superficial frostbite of neck", "score": 0},\n {"term": "Superficial frostbite", "score": 0},\n {"term": "Cellulitis and abscess of mouth", "score": 0},\n {"term": "Frostbite with tissue necrosis of neck", "score": 0},\n {"term": "Other disorders of skin and subcutaneous tissue, not elsewhere classified", "score": 0},\n {"term": "Localized swelling, mass and lump of skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of head, face and neck", "score": 0},\n {"term": "Localized hyperhidrosis", "score": 0},\n {"term": "Excessive and redundant skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of other and unspecified sites", "score": 0},\n {"term": "Superficial frostbite of neck", "score": 0},\n {"term": "Superficial frostbite", "score": 0},\n {"term": "Cellulitis and abscess of mouth", "score": 0},\n {"term": "Frostbite with tissue necrosis of neck", "score": 0},\n {"term": "Other disorders of skin and subcutaneous tissue, not elsewhere classified", "score": 0},\n {"term": "Localized swelling, mass and lump of skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of head, face and neck", "score": 0},\n {"term": "Localized hyperhidrosis", "score": 0},\n {"term": "Excessive and redundant skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of other and unspecified sites", "score": 0},\n {"term": "Superficial frostbite of neck", "score": 0},\n {"term": "Superficial frostbite", "score": 0},\n {"term": "Cellulitis and abscess of mouth", "score": 0},\n {"term": "Frostbite with tissue necrosis of neck", "score": 0},\n {"term": "Other disorders of skin and subcutaneous tissue, not elsewhere classified", "score": 0},\n {"term": "Localized swelling, mass and lump of skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of head, face and neck", "score": 0},\n {"term": "Localized hyperhidrosis", "score": 0},\n {"term": "Excessive and redundant skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of other and unspecified sites", "score": 0},\n {"term": "Superficial frostbite of neck", "score": 0},\n {"term": "Superficial frostbite", "score": 0},\n {"term": "Cellulitis and abscess of mouth", "score": 0},\n {"term": "Frostbite with tissue necrosis of neck", "score": 0},\n {"term": "Other disorders of skin and subcutaneous tissue, not elsewhere classified", "score": 0},\n {"term": "Localized swelling, mass and lump of skin and subcutaneous tissue", "score": 0},\n {"term": "Benign lipomatous neoplasm of skin and subcutaneous tissue of head, face and neck", "score": 0},\n {"

json_repair.loads(): {'text description" : "subcutaneous oxycodone': 'terms" : [\n {"term', 'Localized swelling, mass and lump of skin and subcutaneous tissue': 'score'}

Expected behavior

Ideally, parsed with all the LLM output present in the loaded JSON, but at least something with the "text description" and "terms" objects correctly existing rather than being combined.

I appreciate it might be a big change to json_repair but I did wonder if there might be a way to pass a JSON schema to it, so it can ensure the output conforms.

thigger commented 1 month ago

Apologies; I had an older version in the venv! My mistake. 0.25.3 handles this correctly.