Closed atsuoishimoto closed 1 year ago
Hi there, thanks for the PR.
jsonschema can sometimes be very slow when validation errors occur. This is due to the slow formatting of large schemas or data with pprint.pformat().
This sounds slightly odd -- exceptions are only formatted when they're printed (or otherwise when repr
ends up getting called) -- your application repeatedly prints large numbers of exceptions coming from jsonschema?
Regardless though, this PR likely can't go forward, because it isn't the case that exception contents will be JSON serializable! The instance or schema may have been deserialized using a custom JSONDecoder
which the library won't know about, or may not have originated in JSON ini the first place!
Hi, thank you for comment!
your application repeatedly prints large numbers of exceptions coming from jsonschema?
Yes, we output errors as log entries. Generally, they don't cause harm, but they can be time-consuming when outputting a JSON with numerous elements involved.
Regardless though, this PR likely can't go forward, because it isn't the case that exception contents will be JSON serializable!
Agreed. So the default=repr
option will handle such non-serializable objects. This is also the default behavior when pprint outputs user-defined objects.
I still wouldn't want to use json.dump
s, as it introduces the possibility that a schema is invalid for some instance before dumping but appears to be valid afterwards. This is the case if e.g. as mentioned the user used a custom deserializer, but also in cases like coercing numeric keys (properties) to strings.
Is there a reason in your logs you don't simply format exceptions in whatever way you'd like by pulling off whatever attributes you're interested in?
You might also be interested in #243 which is about making this precise function (pformat
) injectable, and would allow you to inject json.dumps
if you so preferred, without touching the internals. That'd be more like a solution I'd be comfortable with.
I still wouldn't want to use json.dumps, as it introduces the possibility that a schema is invalid for some instance before dumping but appears to be valid afterwards. This is the case if e.g. as mentioned the user used a custom deserializer, but also in cases like coercing numeric keys (properties) to strings.
Points taken.
Is there a reason in your logs you don't simply format exceptions in whatever way you'd like by pulling off whatever attributes you're interested in?
No, we can format our log ourselves. But I thought using json.dumps() is a reasonable default to improve jsonschema.
I'm closing the PR. Thank you!
Fair enough, I appreciate the offer(s) regardless! Thanks for giving it a shot.
jsonschema can sometimes be very slow when validation errors occur. This is due to the slow formatting of large schemas or data with pprint.pformat(). In this PR, I propose replacing pprint.pformat() with json.dumps(), which is expected to perform more than 10 times faster.
The following samples measure the time taken to format 300 GitHub PRs.
pprint.pformat()
json.dumps()
(Memo: If it's possible to add as a dependency, using orjson would make it more than 10 times faster.)
json.dumps() does not perform wordwrapping or other formatting pformat() does. However, such formatting can insert unnecessary newline characters into the data, sometimes making debugging hard. I believe it's more appropriate not to perform such modifications.
:books: Documentation preview :books:: https://python-jsonschema--1145.org.readthedocs.build/en/1145/