Q27 at scale appears to be significantly slower than in the past.
Spacy 3.0 spits out lemmatization warnings on a per token basis when certain pipeline steps are disabled (such as the tagger). This means we may be printing hundreds of millions or billions of warnings. If each call to Python's logging takes even 10 microseconds and each worker has 100 million tokens, we would possibly add up to 1000 seconds per worker.
10 microseconds feels like a lower bound on the time for a logging call. A quick test suggests each call could take about 200 microseconds:
Q27 at scale appears to be significantly slower than in the past.
Spacy 3.0 spits out lemmatization warnings on a per token basis when certain pipeline steps are disabled (such as the tagger). This means we may be printing hundreds of millions or billions of warnings. If each call to Python's
logging
takes even 10 microseconds and each worker has 100 million tokens, we would possibly add up to 1000 seconds per worker.10 microseconds feels like a lower bound on the time for a logging call. A quick test suggests each call could take about 200 microseconds:
We should either disable spacy warnings or pin to spacy 2.3.
See https://github.com/explosion/spaCy/issues/7033 for more information about how we can disable the spacy warnings.