snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Can't create duplicate SparkNLPLabelingFunctions #1499

Closed rjurney closed 4 years ago

rjurney commented 5 years ago

Issue description

I'm not sure if this is or isn't intended, but I wanted to raise an issue I'm having with developing using the pyspark interactive shell in addition to Jupyter Notebooks. When doing development work, I find that when I have to create different SparkNLPLabelingFunctions in multiple iterations on the code, I have to restart Spark to do so. This can be pretty inconvenient if your session has a lot of computation going on.

I'm not sure why this is - I think maybe I am creating multiple instances of an LF with the same name. Are they intended to be immutable by name? It seems desirable to reassign LFs for a given name. Is this a problem with memoize? I'm not sure, I just want to report the problem and understand it.

Code example/repro steps

If I paste this into pyspark twice:

#
# Create Label Functions (LFs) for tag search
#
ABSTAIN = -1

def keyword_lookup(x, keywords, label):

    match = any(word in x._Doc.text for word in keywords)
    # print(keywords, match, label, x)
    if match:
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=ABSTAIN):
    return SparkNLPLabelingFunction(
        name=f"keyword_{keywords}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
        memoize=True,
        text_field='_Body',
        doc_field='_Doc'
    )

# A tag set is the words split by dashes '-' which aids the search (I think)
keyword_lfs = {}
for i, tag_set in enumerated_labels:

    for tag in tag_set.split('-'):

        if tag not in keyword_lfs:
            keyword_lfs[tag] = make_keyword_lf(tag, label=i)

spark_applier = SparkLFApplier(list(keyword_lfs.values()))

test = label_encoded_df.limit(100).rdd

labels = spark_applier.apply(test)

I get this error:

In [85]: from snorkel.labeling.lf.nlp_spark import SparkNLPLabelingFunction

In [86]: keyword_lfs = {}
    ...: for i, tag_set in enumerated_labels:
    ...:
    ...:     for tag in tag_set.split('-'):
    ...:
    ...:         if tag not in keyword_lfs:
    ...:             keyword_lfs[tag] = make_keyword_lf(tag, label=i)
    ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-86-6ae5bbfff661> in <module>
      5
      6         if tag not in keyword_lfs:
----> 7             keyword_lfs[tag] = make_keyword_lf(tag, label=i)
      8

<ipython-input-80-3e53d7ea760b> in make_keyword_lf(keywords, label)
      4         f=keyword_lookup,
      5         resources=dict(keywords=keywords, label=label),
----> 6         memoize=False,
      7     )
      8

~/anaconda3/envs/weak/lib/python3.7/site-packages/snorkel/labeling/lf/nlp.py in __init__(self, name, f, resources, pre, text_field, doc_field, language, disable, memoize)
     81     ) -> None:
     82         self._create_or_check_preprocessor(
---> 83             text_field, doc_field, language, disable, pre or [], memoize
     84         )
     85         super().__init__(name, f, resources=resources, pre=[self._nlp_config.nlp])

~/anaconda3/envs/weak/lib/python3.7/site-packages/snorkel/labeling/lf/nlp.py in _create_or_check_preprocessor(cls, text_field, doc_field, language, disable, pre, memoize)
     64         elif parameters != cls._nlp_config.parameters:
     65             raise ValueError(
---> 66                 f"{cls.__name__} already configured with different parameters: "
     67                 f"{cls._nlp_config.parameters}"
     68             )

ValueError: SparkNLPLabelingFunction already configured with different parameters: SpacyPreprocessorParameters(text_field='_Body', doc_field='_Doc', language='en_core_web_sm', disable=None, pre=[], memoize=True)

Expected behavior

I would like to be able to repeat myself :)

System info

absl-py==0.8.0
argh==0.26.2
asn1crypto==0.24.0
astor==0.8.0
attrs==19.2.0
backcall==0.1.0
beautifulsoup4==4.8.1
bert-for-tf2==0.6.0
bleach==3.1.0
blessings==1.7
blis==0.4.1
boto==2.49.0
boto3==1.9.244
botocore==1.12.244
Bottleneck==1.2.1
bz2file==0.98
certifi==2019.9.11
cffi==1.12.3
chardet==3.0.4
Click==7.0
cloudpickle==1.2.2
configparser==4.0.2
cryptography==2.7
cupy-cuda100==6.4.0
cycler==0.10.0
cymem==2.0.2
cytoolz==0.9.0.1
dask==2.5.0
decorator==4.4.0
defusedxml==0.6.0
dill==0.3.1.1
docker-pycreds==0.4.0
docutils==0.15.2
en-core-web-sm==2.2.0
entrypoints==0.3
fastai==1.0.58
fastparquet==0.3.2
fastprogress==0.1.21
fastrlock==0.4
frozendict==1.2
fsspec==0.5.2
gast==0.2.2
gensim==3.8.1
gitdb2==2.0.6
GitPython==3.0.3
google-pasta==0.1.7
gpustat==0.6.0
gql==0.1.0
graphql-core==2.2.1
grpcio==1.24.1
h5py==2.10.0
idna==2.8
imageio==2.6.0
ipykernel==5.1.2
ipython==7.8.0
ipython-genutils==0.2.0
ipywidgets==7.5.1
iso8601==0.1.12
jedi==0.15.1
Jinja2==2.10.3
jmespath==0.9.4
joblib==0.14.0
jsonschema==3.0.2
jupyter==1.0.0
jupyter-client==5.3.3
jupyter-console==6.0.0
jupyter-core==4.5.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
llvmlite==0.30.0
lxml==4.4.1
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.1.1
mistune==0.8.4
mkl-fft==1.0.14
mkl-random==1.1.0
mkl-service==2.3.0
mock==3.0.5
msgpack==0.6.1
msgpack-numpy==0.4.3.2
murmurhash==1.0.2
nbconvert==5.6.0
nbformat==4.4.0
nbstripout==0.3.6
networkx==2.3
nltk==3.4.5
notebook==6.0.1
numba==0.46.0
numexpr==2.7.0
numpy==1.17.2
nvidia-ml-py==375.53.1
nvidia-ml-py3==7.352.0
olefile==0.46
opt-einsum==3.1.0
packaging==19.2
pandas==0.25.1
pandocfilters==1.4.2
params-flow==0.7.0
parso==0.5.1
pathtools==0.1.2
patsy==0.5.1
pexpect==4.7.0
pickleshare==0.7.5
Pillow==6.2.0
pip-tools==4.1.0
plac==0.9.6
preshed==3.0.2
prometheus-client==0.7.1
promise==2.2.1
prompt-toolkit==2.0.10
protobuf==3.10.0
psutil==5.6.3
ptyprocess==0.6.0
py-params==0.6.4
py4j==0.10.7
pyarrow==0.14.1
pycparser==2.19
Pygments==2.4.2
pyOpenSSL==19.0.0
pyparsing==2.4.2
pyrsistent==0.15.4
PySocks==1.7.1
pyspark==2.4.4
python-dateutil==2.8.0
pytorch-lamb==1.0.0
pytz==2019.3
PyWavelets==1.0.3
PyYAML==5.1.2
pyzmq==18.1.0
qtconsole==4.5.5
regex==2019.8.19
requests==2.22.0
Rx==1.6.1
s3fs==0.3.5
s3transfer==0.2.1
sacremoses==0.0.35
scikit-image==0.15.0
scikit-learn==0.21.3
scipy==1.3.1
seaborn==0.9.0
Send2Trash==1.5.0
sentencepiece==0.1.83
sentry-sdk==0.12.3
shap==0.31.0
shortuuid==0.5.0
six==1.12.0
smart-open==1.8.4
smmap2==2.0.5
snorkel==0.9.3+dev
soupsieve==1.9.4
spacy==2.2.1
srsly==0.1.0
statsmodels==0.10.1
subprocess32==3.5.4
tensorboard==2.0.0
tensorboardX==1.9
tensorflow==2.0.0
tensorflow-estimator==2.0.0
tensorflow-gpu==2.0.0
tensorflow-hub==0.5.0
termcolor==1.1.0
terminado==0.8.2
testpath==0.4.2
textblob==0.15.3
texttable==1.6.2
thinc==7.1.1
thinc-gpu-ops==0.0.4
thrift==0.11.0
toolz==0.10.0
torch==1.1.0
torchvision==0.4.0
tornado==6.0.3
tqdm==4.36.1
traitlets==4.3.3
transformers==2.0.0
ujson==1.35
urllib3==1.25.6
wandb==0.8.12
wasabi==0.2.2
watchdog==0.9.0
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.16.0
widgetsnbextension==3.5.1
wrapt==1.11.2
henryre commented 5 years ago

Hi @rjurney, thanks for reporting!! [Spark]NLPLabelingFunction is designed so that all instances share a single cache, and so in the current implementation, there's a single SpacyPreprocessor which is shared among all the [Spark]NLPLabelingFunctions and so any new [Spark]NLPLabelingFunction must have the same configuration. In your case, they appear to all have the same configuration but Snorkel seems to think they don't.

Just want to double check the setup: keyword_lfs has length greater than 1? And you're just copying and pasting in the exact code block above twice?

As a workaround, you can replicate the behavior of SparkNLPLabelingFunction by making new SpacyPreprocessor objects where appropriate and calling make_spark_preprocessor. See https://github.com/snorkel-team/snorkel/blob/v0.9.2/snorkel/labeling/lf/nlp_spark.py#L52.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.