[CR] Accelerating schema validation

bmcfee commented 7 years ago

This PR accelerates schema validation, in particular for dense array data, following the thread over in #132 / #46 .

So far, the main improvement is due to vectorizing schema validation at the annotation level, rather than validating each observation independently. This provides somewhere between 4x and 8x speedup without changing any functionality.

bmcfee commented 7 years ago

I think there's also redundant serialization happening within validate, and that's now the majority of computation time in validation: more than half the time is now spent in serialization, not validation.

The top-level JAMS validator here does the following:

Validate the entire JAMS object and its sub-objects according to the JAMS schema.
If (1) passes, validate each annotation independently according to namespace schema.

Within the annotation validator here, the following happens:

Run the generic validator for Annotations
Run the namespace-dependent observation array validator

What this all boils down to is that observations get serialized three times: once for the top-level pass (1), once for the generic Annotation pass (3) and once for the namespace pass (4).

So: how can we cut down on redundant serialization?

Skip annotation serialization in JAMS.validate, since those will be done independently anyway. We'd still need to do a post-check to make sure that the annotations array is well formed (ie, an AnnotationArray type), but that's relatively cheap.
Hack Annotation.validate to bypass observations in the first pass. Alternately, promote the vectorized namespace validator to a full annotation validator, and skip the super validate here. (I think I like this second option better.)

@ejhumphrey what do you think?

ejhumphrey commented 7 years ago

I think I need time to digest this later, but first a question – are these mutually exclusive options?

bmcfee commented 7 years ago

are these mutually exclusive options?

Nope, mostly just interested in feedback on whether there might be alternative routes to eliminating redundant serialization (short of caching), and what the best way to go about it might be.

ejhumphrey commented 7 years ago

first read, both seem good, and I think doing first (2) then (1) if needed ... makes sense to me?

bmcfee commented 7 years ago

Ok, doing a light-weight serialization of Annotation (skipping the data field) shaves ~1s off my running example (52s -> 7.1s -> 6.2s).

Gonna try the light-weight ser on JAMS now.

bmcfee commented 7 years ago

Bypassing the annotation array validation within JAMS.validate() brings the running example down to 3.31s. This seems acceptably fast to me, given where it started out.

bmcfee commented 7 years ago

Rolled in a bugfix for #171 that I noticed while implementing tests on the accelerated validation code.

I think this one's ready for CR.

bmcfee commented 7 years ago

One more benchmark, using the biggest file in the MedleyDB dataset: AlexanderRoss_VelvetCurtain.jams (24MB).

Load with no validation:

In [4]: %time J = jams.load('AlexanderRoss_VelvetCurtain.jams', validate=False)
CPU times: user 1.01 s, sys: 48 ms, total: 1.06 s
Wall time: 1.05 s

Load with old, slow validation:

In [5]: %time J = jams.load('AlexanderRoss_VelvetCurtain.jams', validate=True)
CPU times: user 1min 35s, sys: 72 ms, total: 1min 35s
Wall time: 1min 35s

Load with new, fast validation:

In [6]: %time J = jams.load('AlexanderRoss_VelvetCurtain.jams', validate=True)
CPU times: user 6.63 s, sys: 24 ms, total: 6.66 s
Wall time: 6.66 s

marl / jams

[CR] Accelerating schema validation #170