rapidsai / gpu-bdb

RAPIDS GPU-BDB
Apache License 2.0
108 stars 44 forks source link

Q27 slowdown at scale (possibly caused by spaCy warnings) #188

Closed beckernick closed 3 years ago

beckernick commented 3 years ago

Q27 at scale appears to be significantly slower than in the past.

Spacy 3.0 spits out lemmatization warnings on a per token basis when certain pipeline steps are disabled (such as the tagger). This means we may be printing hundreds of millions or billions of warnings. If each call to Python's logging takes even 10 microseconds and each worker has 100 million tokens, we would possibly add up to 1000 seconds per worker.

10 microseconds feels like a lower bound on the time for a logging call. A quick test suggests each call could take about 200 microseconds:

import logging
logger = logging.getLogger('simple_example')
%timeit -n10 logger.warning("hello")
...
233 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We should either disable spacy warnings or pin to spacy 2.3.

See https://github.com/explosion/spaCy/issues/7033 for more information about how we can disable the spacy warnings.