mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

[Bug]: json_repair incredibly slow with from_file #66

Closed benjamingr closed 1 month ago

benjamingr commented 1 month ago

Version of the library

0.25.3

Describe the bug

Parsing a file with from_file is significantly slower than parsing it with .loads due to the StringFileWrapper abstraction. Namely, seeking one byte at a time is incredibly slow.

cProfile and strace show the time is spent in:

 11079787    1.396    0.000   27.749    0.000 json_repair.py:619(get_char_at)
 11080226    3.403    0.000   26.355    0.000 json_repair.py:36(__getitem__)

Which makes sense since the library is reading one byte at a time

How to reproduce

With the following JSON file:

{
    "type": "TPM2_ALG_ECC",
    "nameAlg":"TPM2_ALG_SHA256",
    "srk_template": "system,restricted,decrypt,0x81000001",
    "srk_description": "Storage root key SRK",
    "srk_persistent": 0,
    "ek_template":  "system,restricted,decrypt",
    "ek_description": "Endorsement key EK",
    "ecc_signing_scheme": {
        "scheme":"TPM2_ALG_ECDSA",
        "details":{
            "hashAlg":"TPM2_ALG_SHA256"
        },
    },
    "sym_mode":"TPM2_ALG_CFB",
    "sym_parameters": {
        "algorithm":"TPM2_ALG_AES",
        "keyBits":"128",
        "mode":"TPM2_ALG_CFB"
    },
    "sym_block_size": 16,
    "pcr_selection": [
       { "hash": "TPM2_ALG_SHA1",
         "pcrSelect": [ ],
       },
       { "hash": "TPM2_ALG_SHA256",
         "pcrSelect": [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 ]
       }
    ],
    "curveID": "TPM2_ECC_NIST_P256",
    "ek_policy": {
        "description": "Endorsement hierarchy used for policy secret.",
        "policy":[
            {
                "type":"POLICYSECRET",
                "objectName": "4000000b",
            }
        ]
    }
}

Calling:

json_repair.from_file('/tmp/json2.json') hangs for a long time but json_repair.loads(open('/tmp/json2.json').read()) finishes instantly

Expected behavior

from_file should be as fast as .loads(open(file).read())

mangiucugna commented 1 month ago

Hi! Thanks a lot for the report, finally someone that is using this feature. I will need to find some time to sit down and think about it. The basic idea is to avoid loading the entire file in memory because you don’t know how big that is, maybe I am overthinking it. Nonetheless the library should likely load a large chunk of the file and not one byte at a time

Feel free to think along with me and/or push a PR

benjamingr commented 1 month ago

Thanks for the quick response!

I think it's fine to buffer the first n bytes of the file (e.g. 1MB) and fall back on loading it in chunks (of 1MB) at a time if it's larger than that arbitrary threshold.

benjamingr commented 1 month ago

Here is a draft patch to indicate what I had in mind https://github.com/mangiucugna/json_repair/pull/67 if you think it's a good direction I'm happy to test it better and run all the tooling.

I'm not sure this is the right approach it's probably still much faster to just check if the file is small and if it's under a size where it's problematic from a streaming/memory PoV just read it.

mangiucugna commented 1 month ago

Thanks for the PR, the direction seems correct, I haven’t had the time to review in detail. One thing I didn’t see, and that I think is necessary, is the freeing of the buffer to save memory (say you keep max 2 chunks).

This is because otherwise is exactly the same as loading the entire file in memory, the good news is that the library looks back very little so just keeping 2 should be more than enough.