tsproisl / SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.
GNU General Public License v3.0
135 stars 21 forks source link

Thread safety #14

Closed konstantinmiller closed 4 years ago

konstantinmiller commented 4 years ago

Is SoMaJo thread safe? Can I, e.g., use it with joblib to parallelize operation?

tsproisl commented 4 years ago

Depending on your input, you might want to pursue different strategies.

The obvious way to speed up tokenization would be to set the optional parameter parallel that is supported by all four tokenize methods (see API documentation). This uses multiprocessing under the hood and will tokenize paragraphs (or XML chunks, for XML input) in parallel. Because of the overhead associated with multiprocessing, the chunksize is currently hard-coded to 250. As a result, you won't see any speed-up, if your input has fewer than 250 paragraphs or XML chunks. I could make the chunksize available as a parameter, though. Still, this kind of parallelization is tailored towards a use-case where your input texts are relatively large.

If you have a large number of small input texts, it might be better to parallelize the calls to the respective tokenize method. This will fail because the tokenize methods return a generator object which cannot be pickled. One possible solution would be to wrap the call to the tokenize method in a function that converts the output to a list and to parallelize that (see the example below using multiprocessing).

#!/usr/bin/env python3

import multiprocessing

from somajo import SoMaJo

def tokenize_text(paragraphs):
    return list(tokenizer.tokenize_text(paragraphs))

def main():
    global tokenizer
    tokenizer = SoMaJo("de_CMC")
    paragraphs = ["der beste Betreuer?\n-- ProfSmith! : )",
                  "Was machst du morgen Abend?! Lust auf Film?;-)"]
    texts = [paragraphs * 4]
    with multiprocessing.Pool(processes=2) as pool:
        tokenized_texts = pool.imap(tokenize_text, texts)
        for tokenized_text in tokenized_texts:
            for sentence in tokenized_text:
                print(" ".join(token.text for token in sentence))

if __name__ == "__main__":
    main()
tsproisl commented 4 years ago

To answer the original question: Yes, SoMaJo is thread-safe, but you might run into trouble unless joblib is doing something smart when trying to serialize generators. Luckily, it should be easy to work around that problem.

konstantinmiller commented 4 years ago

Thanks for the detailed answer! I haven't noticed the parallel option. Sorry for that!

I have a few millions of documents with a few hundred characters each (mean is 60, max is 10,000). So, would you then suggest not to use the parallel option?

konstantinmiller commented 4 years ago

EDIT: Added flattening of the returned list after the call to joblib.

So, I just tried it out and with the parallel option I'm getting around 10% to 20% of CPU utilization on a 16-core machine (64 GB RAM). While parallelizing "by hand" gives close to 100% utilization. I'm doing

class Tokenizer:

    _tokenizer = SoMaJo("de_CMC")

    @classmethod
    def _detect_sentences_worker(cls, doc: str) -> Sequence[str]:
        """Receives a document and returns a list of sentences"""
        sentences = cls._tokenizer.tokenize_text([doc])
        sentences = [' '.join(t.text for t in s) for s in sentences]
        return sentences

    @classmethod
    def detect_sentences(cls, docs: Sequence[str], n_jobs=-2) -> Sequence[str]:
        print(f'Detecting sentences in {len(docs)} documents')
        sentences = joblib.Parallel(n_jobs=n_jobs)(joblib.delayed(cls._detect_sentences_worker)(doc) for doc in docs)
        sentences = [sentence for sentences in sentences_nested for sentence in sentences]
        print(f'Detected {len(sentences)} sentences. That\'s on average {len(sentences) / len(docs):.1f} per document.')
        return sentences
tsproisl commented 4 years ago

In your example, you have all your documents in an iterable and treat each document as a single paragraph. In that case, I would expect that tokenizer.tokenize_text(docs, parallel=14) is roughly as efficient as your approach. Is that how you tried it? In general, if you can turn your documents into a stream of paragraphs, the parallel option should be able to process that stream efficiently. If that's not the case, I will have to take a closer look ;-).

konstantinmiller commented 4 years ago

Ah, no, I guess I just did parallel=True so that it effectively didn't parallelize. My bad.