Closed hardianlawi closed 2 years ago
Ultimately, the thing that I care about the most is to improve the speed. I'm wondering if it makes sense to have something like BatchPandasLFApplier
. spacy
model providesnlp.pipe
method which could boost the speed by passing batch of inputs, but currently snorkel isn't really using it.
@hardianlawi Thanks for sharing! Have you tried using the NLPLabelingFunction class. It is located here: https://github.com/snorkel-team/snorkel/blob/b3b0669f716a7b3ed6cd573b57f3f8e12bcd495a/snorkel/labeling/lf/nlp.py
It doesn't seem to help.
import spacy
import pandas as pd
from timeit import default_timer
from snorkel.labeling import labeling_function, PandasLFApplier
from snorkel.labeling.lf.nlp import nlp_labeling_function
from snorkel.labeling.apply.dask import PandasParallelLFApplier
from snorkel.preprocess import preprocessor
@nlp_labeling_function()
def greater_than_10(x):
try:
if int(x.doc[-1].text) > 10:
return 1
except Exception as e:
return 0
return 0
df = pd.DataFrame({"text": [f"This is sentence {num}" for num in range(2000)]})
applier = PandasParallelLFApplier([greater_than_10])
start = default_timer()
L = applier.apply(df, n_parallel=4)
end = default_timer()
print(end - start) # 18.304681275971234
In the end, what I did is this which give almost 10x boost. If you have GPU, you could run spacy.prefer_gpu()
before loading the nlp model.
import spacy
import pandas as pd
from timeit import default_timer
from snorkel.labeling import LFApplier, labeling_function
from snorkel.preprocess import preprocessor
nlp = spacy.load("en_core_web_sm")
@labeling_function()
def greater_than_10(x):
try:
if int(x[-1].text) > 10:
return 1
except Exception as e:
return 0
return 0
df = pd.DataFrame({"text": [f"This is sentence {num}" for num in range(2000)]})
applier = LFApplier([greater_than_10])
start = default_timer()
L = applier.apply(nlp.pipe(df['text'].values))
end = default_timer()
print(end - start) # 1.2438313760212623
Interesting. Thanks for the insight!
We've also attached the "feature request" and "help wanted" tags so others know that this (incorporating nlp.pipe
to speed up spacy execution) is a change we'd be open to reviewing and merging if you or anyone else from the community contributes a PR for it.
@bhancock8 Sure. I'll send a PR for this. In the meantime, I have created another PR to bump up the tensorboard version and I need someone to review it for me as I'm not sure why the CI is failing 🙏🏼 .
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Issue description
PandasParallelLFApplier
does not improve the speed when usingspacy
as preprocessing step.Code example/repro steps
No parallelization
With parallelization
Expected behavior
I would expect
PandasParallelLFApplier
to improve the speed at least a little bit even if it's not muchSystem info
Additional context
I think the problem is because the spacy library is not sharable between processes(?). I would love to know the workaround for this if anyone has a solution (with or without parallelization).