Closed bmcfee closed 7 years ago
I think there's also redundant serialization happening within validate
, and that's now the majority of computation time in validation: more than half the time is now spent in serialization, not validation.
The top-level JAMS validator here does the following:
Within the annotation validator here, the following happens:
What this all boils down to is that observations get serialized three times: once for the top-level pass (1), once for the generic Annotation pass (3) and once for the namespace pass (4).
So: how can we cut down on redundant serialization?
Skip annotation serialization in JAMS.validate
, since those will be done independently anyway. We'd still need to do a post-check to make sure that the annotations
array is well formed (ie, an AnnotationArray type), but that's relatively cheap.
Hack Annotation.validate
to bypass observations in the first pass. Alternately, promote the vectorized namespace validator to a full annotation validator, and skip the super
validate here. (I think I like this second option better.)
@ejhumphrey what do you think?
I think I need time to digest this later, but first a question – are these mutually exclusive options?
are these mutually exclusive options?
Nope, mostly just interested in feedback on whether there might be alternative routes to eliminating redundant serialization (short of caching), and what the best way to go about it might be.
first read, both seem good, and I think doing first (2) then (1) if needed ... makes sense to me?
Ok, doing a light-weight serialization of Annotation
(skipping the data
field) shaves ~1s off my running example (52s -> 7.1s -> 6.2s).
Gonna try the light-weight ser on JAMS
now.
Bypassing the annotation array validation within JAMS.validate()
brings the running example down to 3.31s. This seems acceptably fast to me, given where it started out.
Rolled in a bugfix for #171 that I noticed while implementing tests on the accelerated validation code.
I think this one's ready for CR.
One more benchmark, using the biggest file in the MedleyDB dataset: AlexanderRoss_VelvetCurtain.jams
(24MB).
In [4]: %time J = jams.load('AlexanderRoss_VelvetCurtain.jams', validate=False)
CPU times: user 1.01 s, sys: 48 ms, total: 1.06 s
Wall time: 1.05 s
In [5]: %time J = jams.load('AlexanderRoss_VelvetCurtain.jams', validate=True)
CPU times: user 1min 35s, sys: 72 ms, total: 1min 35s
Wall time: 1min 35s
In [6]: %time J = jams.load('AlexanderRoss_VelvetCurtain.jams', validate=True)
CPU times: user 6.63 s, sys: 24 ms, total: 6.66 s
Wall time: 6.66 s
This PR accelerates schema validation, in particular for dense array data, following the thread over in #132 / #46 .
So far, the main improvement is due to vectorizing schema validation at the annotation level, rather than validating each observation independently. This provides somewhere between 4x and 8x speedup without changing any functionality.