Using pathos from within a function is slow

VolodyaCO commented 3 years ago

I have the following issue that I explained in stackoverflow, and will explain here:

I am trying to use pathos for triggering multiprocessing within a function. I notice, however, an odd behaviour and don't know why:

import spacy
from pathos.multiprocessing import ProcessPool as Pool

nlp = spacy.load("es_core_news_sm")

def preworker(text, nlp):
    return [w.lemma_ for w in nlp(text)]

worker = lambda text: preworker(text, nlp)

texts = ["Este es un texto muy interesante en español"] * 10

# Run this in jupyter:
%%time

pool = Pool(3)
r = pool.map(worker, texts)

The output is

CPU times: user 6.6 ms, sys: 26.5 ms, total: 33.1 ms
Wall time: 141 ms

So far so good... Now I define the same exact calculation, but from a function:

def out_worker(texts, nlp):
    worker = lambda text: preworker(text, nlp)
    pool = Pool(3)
    return pool.map(worker, texts)

# Run this in jupyter:
%%time 

r = out_worker(texts, nlp)

The output now is

CPU times: user 10.2 s, sys: 591 ms, total: 10.8 s
Wall time: 13.4 s

Why is there such a large difference? My hypothesis, though I don't know why, is that in the second case a copy of the nlp object is sent to every single job.

Also, how can I correctly call this multiprocessing from within a function?

Thanks

mmckerns commented 3 years ago

See my comments here: https://stackoverflow.com/a/66808832/2379433

If this sufficiently answers your question, then close this ticket.

VolodyaCO commented 3 years ago

It answers the question. Thank you. I have also raised an issue in the p_tqdm project.

uqfoundation / pathos

Using pathos from within a function is slow #209