Closed rjurney closed 4 years ago
Hi @rjurney, thanks for reporting!! [Spark]NLPLabelingFunction
is designed so that all instances share a single cache, and so in the current implementation, there's a single SpacyPreprocessor
which is shared among all the [Spark]NLPLabelingFunction
s and so any new [Spark]NLPLabelingFunction
must have the same configuration. In your case, they appear to all have the same configuration but Snorkel seems to think they don't.
Just want to double check the setup: keyword_lfs
has length greater than 1? And you're just copying and pasting in the exact code block above twice?
As a workaround, you can replicate the behavior of SparkNLPLabelingFunction
by making new SpacyPreprocessor
objects where appropriate and calling make_spark_preprocessor
. See https://github.com/snorkel-team/snorkel/blob/v0.9.2/snorkel/labeling/lf/nlp_spark.py#L52.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Issue description
I'm not sure if this is or isn't intended, but I wanted to raise an issue I'm having with developing using the
pyspark
interactive shell in addition to Jupyter Notebooks. When doing development work, I find that when I have to create differentSparkNLPLabelingFunctions
in multiple iterations on the code, I have to restart Spark to do so. This can be pretty inconvenient if your session has a lot of computation going on.I'm not sure why this is - I think maybe I am creating multiple instances of an LF with the same name. Are they intended to be immutable by name? It seems desirable to reassign LFs for a given name. Is this a problem with memoize? I'm not sure, I just want to report the problem and understand it.
Code example/repro steps
If I paste this into
pyspark
twice:I get this error:
Expected behavior
I would like to be able to repeat myself :)
System info