python-jsonschema / jsonschema

An implementation of the JSON Schema specification for Python
https://python-jsonschema.readthedocs.io
MIT License
4.61k stars 581 forks source link

Faster pprint #1145

Closed atsuoishimoto closed 1 year ago

atsuoishimoto commented 1 year ago

jsonschema can sometimes be very slow when validation errors occur. This is due to the slow formatting of large schemas or data with pprint.pformat(). In this PR, I propose replacing pprint.pformat() with json.dumps(), which is expected to perform more than 10 times faster.

The following samples measure the time taken to format 300 GitHub PRs.

pprint.pformat()

$ python3 -m timeit -s "import requests, pprint;L=requests.get('https://api.github.com/repos/python/cpython/pulls?state=all').json() * 100" "pprint.pformat(L)"
1 loop, best of 5: 14.2 sec per loop

json.dumps()

$ python3 -m timeit -s "import requests, json;L=requests.get('https://api.github.com/repos/python/cpython/pulls?state=all').json() * 100" "json.dumps(L, ensure_ascii=False, indent=2, default=repr)"
1 loop, best of 5: 1.19 sec per loop

(Memo: If it's possible to add as a dependency, using orjson would make it more than 10 times faster.)

python3 -m timeit -s "import requests, orjson;L=requests.get('https://api.github.com/repos/python/cpython/pulls?state=all').json() * 100" "orjson.dumps(L, option=orjson.OPT_INDENT_2|orjson.OPT_SORT_KEYS, default=repr)"
5 loops, best of 5: 79.2 msec per loop

json.dumps() does not perform wordwrapping or other formatting pformat() does. However, such formatting can insert unnecessary newline characters into the data, sometimes making debugging hard. I believe it's more appropriate not to perform such modifications.


:books: Documentation preview :books:: https://python-jsonschema--1145.org.readthedocs.build/en/1145/

Julian commented 1 year ago

Hi there, thanks for the PR.

jsonschema can sometimes be very slow when validation errors occur. This is due to the slow formatting of large schemas or data with pprint.pformat().

This sounds slightly odd -- exceptions are only formatted when they're printed (or otherwise when repr ends up getting called) -- your application repeatedly prints large numbers of exceptions coming from jsonschema?

Regardless though, this PR likely can't go forward, because it isn't the case that exception contents will be JSON serializable! The instance or schema may have been deserialized using a custom JSONDecoder which the library won't know about, or may not have originated in JSON ini the first place!

atsuoishimoto commented 1 year ago

Hi, thank you for comment!

your application repeatedly prints large numbers of exceptions coming from jsonschema?

Yes, we output errors as log entries. Generally, they don't cause harm, but they can be time-consuming when outputting a JSON with numerous elements involved.

Regardless though, this PR likely can't go forward, because it isn't the case that exception contents will be JSON serializable!

Agreed. So the default=repr option will handle such non-serializable objects. This is also the default behavior when pprint outputs user-defined objects.

Julian commented 1 year ago

I still wouldn't want to use json.dumps, as it introduces the possibility that a schema is invalid for some instance before dumping but appears to be valid afterwards. This is the case if e.g. as mentioned the user used a custom deserializer, but also in cases like coercing numeric keys (properties) to strings.

Is there a reason in your logs you don't simply format exceptions in whatever way you'd like by pulling off whatever attributes you're interested in?

You might also be interested in #243 which is about making this precise function (pformat) injectable, and would allow you to inject json.dumps if you so preferred, without touching the internals. That'd be more like a solution I'd be comfortable with.

atsuoishimoto commented 1 year ago

I still wouldn't want to use json.dumps, as it introduces the possibility that a schema is invalid for some instance before dumping but appears to be valid afterwards. This is the case if e.g. as mentioned the user used a custom deserializer, but also in cases like coercing numeric keys (properties) to strings.

Points taken.

Is there a reason in your logs you don't simply format exceptions in whatever way you'd like by pulling off whatever attributes you're interested in?

No, we can format our log ourselves. But I thought using json.dumps() is a reasonable default to improve jsonschema.

I'm closing the PR. Thank you!

Julian commented 1 year ago

Fair enough, I appreciate the offer(s) regardless! Thanks for giving it a shot.