python-jsonschema / referencing

Cross-specification JSON referencing (JSON Schema, OpenAPI, and the one you just made up!)
https://referencing.readthedocs.io/
MIT License
40 stars 12 forks source link

Performance of referencing library vs deprecated jsonschema.RefResolver is very bad when there are a lot of references in schema #178

Open nathan-stender opened 2 months ago

nathan-stender commented 2 months ago

Hello!

I have a library that formats scientific data into a JSON schema called the Allotrope Standard Model (ASM)

The validation schemas are fairly large and complicated compared to other schemas I've seen in discussion boards, and are very modular, meaning there are a lot of references. In allotropy we store the ASM schemas directly, and remove all remote references, replacing them with local references under $defs.

We are finding that validating against the schemas using jsonschema version 4.18.0 takes ~20x longer than 4.17.0.

As a concrete example:

Validating this data: https://raw.githubusercontent.com/Benchling-Open-Source/allotropy/refs/heads/main/tests/parsers/moldev_softmax_pro/testdata/MD_SMP_luminescence_endpoint_example08.json

Against this schema: https://github.com/Benchling-Open-Source/allotropy/blob/main/src/allotropy/allotrope/schemas/adm/plate-reader/REC/2024/06/plate-reader.schema.json

takes ~3.5s on 4.17.0 and ~55s on 4.18.0

This translates to a runtime for all 26 tests in tests/parsers/moldev_softmax_pro of ~30s in 4.17.0 to ~6m in 4.18.0

Julian commented 2 months ago

Hey there, I'm happy to have a look at this at some point, but is there a reason you're benchmarking against such an old version? Lots has changed since 4.18, so it'd be good if you shared numbers which were on 4.23.

nathan-stender commented 2 months ago

Sorry, I didn't mention that I tested on every version between 4.18 and 4.23 to see if any had better performance. None of the versions past 4.18 improve the performance noticeably.

On 4.23, the results are actually a bit worse:

For the single test: 55s For the 26 tests: 6m33s

sameeul commented 2 weeks ago

We have also experienced similar performance issue in one of our tool after switching from RefResolver to this library. This is the commit in our library: https://github.com/PolusAI/workflow-inference-compiler/pull/287