jupyter / nbformat

Reference implementation of the Jupyter Notebook format
http://nbformat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
265 stars 152 forks source link

Add ability to capture validation errors #249

Closed kevin-bates closed 2 years ago

kevin-bates commented 2 years ago

This pull request stems from an issue discovered during Jupyter Server performance analysis in which it was learned that the server is validating notebooks twice (and on both read and write operations). The crux of the issue is that both methods do not give the caller information regarding the validation status of their content - so they must explicitly validate a second time. This appears to have been the case for the last 7 to 8 years (since the code resided in notebook). I suspect this is based on decisions at the time, but this pull request attempts to rectify this in a (hopefully) acceptable manner. Since the elimination of the redundant validation can improve performance by nearly 50%, this seems like an opportunity we should not ignore.

Because so many years have passed, I believe there are at least two backward-compatibility concerns that lead to this particular resolution.

  1. The read/write calls do not inform the caller that the operation encountered a validation error. As a result, the content (on read) is always returned, even in light of validation errors, which, today, are logged as errors. This behavior also leads to the content always being written, again, in light of validation errors. (Of course, filesystem-related issues will raise appropriate exceptions, but all things being equal, the caller has no idea a validation issue is present.)
  2. Applications wish to format their own validation error. Jupyter Server (and notebook) used a subsequent (and unconditional) call to nbformat's validate() method, to capture the exception and produce an application-friendly error message when validation issues are encountered.

This change adds an optional dictionary-valued parameter (capture_validation_error) that applications can pass to capture the ValidationError instance for use by the calling application. If the parameter is non-None and a dictionary instance, a validation error will be added into the dictionary under the key 'ValidationError' while the corresponding value will contain the ValidationError instance. This would allow applications that make an additional call to validate() to remove the second call since they have both their content (on reads) and the ValidationError instance (when validation issues are present) they can use to make decisions.

Alternative approaches

  1. One approach would be to add an optional boolean parameter that skips validation on the read/write methods, leaving actual validation up to the application. Since today's read/write code leaves no indication that a validation error was encountered, this approach would work as well. However, given that read/write methods already include validation, it does require an additional method call and feels a bit clunky as it violates the original intent.
  2. Another approach would be to raise ValidationError when it occurs, which also allows the application to produce a friendly message, but will prevent the return of content (on reads) and persistence of content (on writes) - that is assumed behavior today.

There may be other solutions, but I think we should do something as this is the kind of low-hanging fruit that performance-sensitive folks live for. :smile:

cc: @goanpeca, @mlucool

kevin-bates commented 2 years ago

Looks like the Windows 3.6 platform is no longer available for CI. I can remove 3.6 from the CI matrix, but that feels better suited for another PR.

kevin-bates commented 2 years ago

Thank you @blink1073! It will be great to take advantage of these changes in Jupyter Server.