mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

[Bug]: issue with parsing rust code in JSON #72

Closed pchalasani closed 3 weeks ago

pchalasani commented 3 weeks ago

Version of the library

0.29.4

Describe the bug

In the example below the value of content field is a piece of rust code, it starts with double-quotes (") and ends with double-quotes. The repaired json should simply escape the newlines within the string, and the content field value should be the entire code, but I get unexpected results as you see below.

In [21]: s = '''
    ...: {
    ...:     "request": "write_file_tool",
    ...:     "file_path": "src/lib.rs",
    ...:     "content": "use std::cmp::Ordering;
    ...:
    ...: pub struct GuessGame {
    ...:     secret_number: u32,
    ...: }
    ...:
    ...: impl GuessGame {
    ...:     pub fn new() -> Self {
    ...:         GuessGame { secret_number: 42 }
    ...:     }
    ...:
    ...:     pub fn check_guess(&self, guess: u32) -> String {
    ...:         match guess.cmp(&self.secret_number) {
    ...:             Ordering::Less => String::from(\"Too low!\"),
    ...:             Ordering::Greater => String::from(\"Too high!\"),
    ...:             Ordering::Equal => String::from(\"You got it!\"),
    ...:         }
    ...:     }
    ...: }
    ...: "
    ...: }
    ...: '''

In [22]: repair_json(s)
Out[22]: '[{"request": "write_file_tool", "file_path": "src/lib.rs", "content": "use std::cmp::Ordering;\\n\\npub struct GuessGame {\\n    secret_number: u32"}, {"secret_number": 42}, {"Ordering": "Less => String::from(", "low!": "Ordering::Greater => String::from(", "high!": "Ordering::Equal => String::from(", "it!": ""}]'

How to reproduce

see above.

Expected behavior

The repaired json should simply escape the newlines within the string, and the content field value should be the entire code, but I get unexpected results as you see below.

EDITED - sorry for the multiple edits - I now have the actual failure case shown above.

pchalasani commented 3 weeks ago

btw I really like this library for handling JSONs from weak LLMs, but ran into the above issue.

pchalasani commented 3 weeks ago

Here's a link to the playground where I tried it.

mangiucugna commented 3 weeks ago

Hi! I found the issue in the parser so I will release a new version shortly, one note for your use case though. You need to use realstrings to pass the escaping correctly:

json_to_fix = r"""
{
    "request": "write_file_tool",
    "file_path": "src/lib.rs",
    "content": "use std::cmp::Ordering;

pub struct GuessGame {
    secret_number: u32
}

impl GuessGame {
    pub fn new() -> Self {
        GuessGame { secret_number: 42 }
    }

    pub fn check_guess(&self, guess: u32) -> String {
        match guess.cmp(&self.secret_number) {
            Ordering::Less => String::from(\"Too low!\"),
            Ordering::Greater => String::from(\"Too high!\"),
            Ordering::Equal => String::from(\"You got it!\"),
        }
    }
}
"
}
"""

pprint(json_repair.loads(json_to_fix))

Output:

{'content': 'use std::cmp::Ordering;\n'
            '\n'
            'pub struct GuessGame {\n'
            '    secret_number: u32\n'
            '}\n'
            '\n'
            'impl GuessGame {\n'
            '    pub fn new() -> Self {\n'
            '        GuessGame { secret_number: 42 }\n'
            '    }\n'
            '\n'
            '    pub fn check_guess(&self, guess: u32) -> String {\n'
            '        match guess.cmp(&self.secret_number) {\n'
            '            Ordering::Less => String::from("Too low!"),\n'
            '            Ordering::Greater => String::from("Too high!"),\n'
            '            Ordering::Equal => String::from("You got it!"),\n'
            '        }\n'
            '    }\n'
            '}',
 'file_path': 'src/lib.rs',
 'request': 'write_file_tool'}

The presence of \n is a bit of a funkyness of how escaping works but shouldn't be a problem for you to just simply remove those are process them appropriately. Parsing Rust is particularly difficult because the syntax resembles json quite a bit.