snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

PandasParallelLFApplier does not run in parallel when using Spacy as preprocessor #1662

Closed hardianlawi closed 2 years ago

hardianlawi commented 3 years ago

Issue description

PandasParallelLFApplier does not improve the speed when using spacy as preprocessing step.

Code example/repro steps

No parallelization

import spacy
import pandas as pd
from timeit import default_timer
from snorkel.labeling import PandasLFApplier, labeling_function
from snorkel.preprocess import preprocessor

nlp = spacy.load("en_core_web_sm")

@preprocessor(memoize=True)
def spacy_doc(x):
    doc = nlp(x.text)
    return doc

@labeling_function(pre=[spacy_doc])
def greater_than_10(x):
    for token in x:
        pass
    if int(token.text) > 10:
        return 1
    return 0

df = pd.DataFrame({"text": [f"This is sentence {num}" for num in range(2000)]})
applier = PandasLFApplier([greater_than_10])

start = default_timer()
L = applier.apply(df)
end = default_timer()

print(end - start) # 10.994037621989264 seconds

With parallelization

import spacy
import pandas as pd
from timeit import default_timer
from snorkel.labeling import labeling_function, PandasLFApplier
from snorkel.labeling.apply.dask import PandasParallelLFApplier
from snorkel.preprocess import preprocessor

nlp = spacy.load("en_core_web_sm")

@preprocessor(memoize=True)
def spacy_doc(x):
    doc = nlp(x.text)
    return doc

@labeling_function(pre=[spacy_doc])
def greater_than_10(x):
    try:
        if int(x[-1].text) > 10:
            return 1
    except Exception as e:
        return 0
    return 0

df = pd.DataFrame({"text": [f"This is sentence {num}" for num in range(2000)]})
applier = PandasParallelLFApplier([greater_than_10])

start = default_timer()
L = applier.apply(df, n_parallel=4)
end = default_timer()

print(end - start) # 17.868044445000123

Expected behavior

I would expect PandasParallelLFApplier to improve the speed at least a little bit even if it's not much

System info

Additional context

I think the problem is because the spacy library is not sharable between processes(?). I would love to know the workaround for this if anyone has a solution (with or without parallelization).

hardianlawi commented 3 years ago

Ultimately, the thing that I care about the most is to improve the speed. I'm wondering if it makes sense to have something like BatchPandasLFApplier. spacy model providesnlp.pipe method which could boost the speed by passing batch of inputs, but currently snorkel isn't really using it.

vkrishnamurthy11 commented 3 years ago

@hardianlawi Thanks for sharing! Have you tried using the NLPLabelingFunction class. It is located here: https://github.com/snorkel-team/snorkel/blob/b3b0669f716a7b3ed6cd573b57f3f8e12bcd495a/snorkel/labeling/lf/nlp.py

hardianlawi commented 3 years ago

It doesn't seem to help.

import spacy
import pandas as pd
from timeit import default_timer
from snorkel.labeling import labeling_function, PandasLFApplier
from snorkel.labeling.lf.nlp import nlp_labeling_function
from snorkel.labeling.apply.dask import PandasParallelLFApplier
from snorkel.preprocess import preprocessor

@nlp_labeling_function()
def greater_than_10(x):
    try:
        if int(x.doc[-1].text) > 10:
            return 1
    except Exception as e:
        return 0
    return 0

df = pd.DataFrame({"text": [f"This is sentence {num}" for num in range(2000)]})
applier = PandasParallelLFApplier([greater_than_10])

start = default_timer()
L = applier.apply(df, n_parallel=4)
end = default_timer()

print(end - start) # 18.304681275971234
hardianlawi commented 3 years ago

In the end, what I did is this which give almost 10x boost. If you have GPU, you could run spacy.prefer_gpu() before loading the nlp model.

import spacy
import pandas as pd
from timeit import default_timer
from snorkel.labeling import LFApplier, labeling_function
from snorkel.preprocess import preprocessor

nlp = spacy.load("en_core_web_sm")

@labeling_function()
def greater_than_10(x):
    try:
        if int(x[-1].text) > 10:
            return 1
    except Exception as e:
        return 0
    return 0

df = pd.DataFrame({"text": [f"This is sentence {num}" for num in range(2000)]})
applier = LFApplier([greater_than_10])

start = default_timer()
L = applier.apply(nlp.pipe(df['text'].values))
end = default_timer()

print(end - start) # 1.2438313760212623
vkrishnamurthy11 commented 3 years ago

Interesting. Thanks for the insight!

bhancock8 commented 3 years ago

We've also attached the "feature request" and "help wanted" tags so others know that this (incorporating nlp.pipe to speed up spacy execution) is a change we'd be open to reviewing and merging if you or anyone else from the community contributes a PR for it.

hardianlawi commented 3 years ago

@bhancock8 Sure. I'll send a PR for this. In the meantime, I have created another PR to bump up the tensorboard version and I need someone to review it for me as I'm not sure why the CI is failing 🙏🏼 .

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.