Closed konstantinmiller closed 4 years ago
Depending on your input, you might want to pursue different strategies.
The obvious way to speed up tokenization would be to set the optional parameter parallel
that is supported by all four tokenize methods (see API documentation). This uses multiprocessing
under the hood and will tokenize paragraphs (or XML chunks, for XML input) in parallel. Because of the overhead associated with multiprocessing, the chunksize is currently hard-coded to 250. As a result, you won't see any speed-up, if your input has fewer than 250 paragraphs or XML chunks. I could make the chunksize available as a parameter, though. Still, this kind of parallelization is tailored towards a use-case where your input texts are relatively large.
If you have a large number of small input texts, it might be better to parallelize the calls to the respective tokenize method. This will fail because the tokenize methods return a generator object which cannot be pickled. One possible solution would be to wrap the call to the tokenize method in a function that converts the output to a list and to parallelize that (see the example below using multiprocessing
).
#!/usr/bin/env python3
import multiprocessing
from somajo import SoMaJo
def tokenize_text(paragraphs):
return list(tokenizer.tokenize_text(paragraphs))
def main():
global tokenizer
tokenizer = SoMaJo("de_CMC")
paragraphs = ["der beste Betreuer?\n-- ProfSmith! : )",
"Was machst du morgen Abend?! Lust auf Film?;-)"]
texts = [paragraphs * 4]
with multiprocessing.Pool(processes=2) as pool:
tokenized_texts = pool.imap(tokenize_text, texts)
for tokenized_text in tokenized_texts:
for sentence in tokenized_text:
print(" ".join(token.text for token in sentence))
if __name__ == "__main__":
main()
To answer the original question: Yes, SoMaJo is thread-safe, but you might run into trouble unless joblib
is doing something smart when trying to serialize generators. Luckily, it should be easy to work around that problem.
Thanks for the detailed answer! I haven't noticed the parallel
option. Sorry for that!
I have a few millions of documents with a few hundred characters each (mean is 60, max is 10,000). So, would you then suggest not to use the parallel
option?
EDIT: Added flattening of the returned list after the call to joblib
.
So, I just tried it out and with the parallel
option I'm getting around 10% to 20% of CPU utilization on a 16-core machine (64 GB RAM). While parallelizing "by hand" gives close to 100% utilization. I'm doing
class Tokenizer:
_tokenizer = SoMaJo("de_CMC")
@classmethod
def _detect_sentences_worker(cls, doc: str) -> Sequence[str]:
"""Receives a document and returns a list of sentences"""
sentences = cls._tokenizer.tokenize_text([doc])
sentences = [' '.join(t.text for t in s) for s in sentences]
return sentences
@classmethod
def detect_sentences(cls, docs: Sequence[str], n_jobs=-2) -> Sequence[str]:
print(f'Detecting sentences in {len(docs)} documents')
sentences = joblib.Parallel(n_jobs=n_jobs)(joblib.delayed(cls._detect_sentences_worker)(doc) for doc in docs)
sentences = [sentence for sentences in sentences_nested for sentence in sentences]
print(f'Detected {len(sentences)} sentences. That\'s on average {len(sentences) / len(docs):.1f} per document.')
return sentences
In your example, you have all your documents in an iterable and treat each document as a single paragraph. In that case, I would expect that tokenizer.tokenize_text(docs, parallel=14)
is roughly as efficient as your approach. Is that how you tried it? In general, if you can turn your documents into a stream of paragraphs, the parallel option should be able to process that stream efficiently. If that's not the case, I will have to take a closer look ;-).
Ah, no, I guess I just did parallel=True
so that it effectively didn't parallelize. My bad.
Is SoMaJo thread safe? Can I, e.g., use it with
joblib
to parallelize operation?