snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Add snorkel.labeling.filter_unlabeled_rdd utility #1525

Open rjurney opened 4 years ago

rjurney commented 4 years ago

Is your feature request related to a problem? Please describe.

I love snorkel.labeling.filter_unlabeled_dataframe(). I want a pyspark equivalent: snorkel.labeling.filter_unlabeled_spark_rdd or snorkel.labeling.filter_unlabeled_spark_dataframe.

Describe the solution you'd like

Implement the same filtering for pyspark.sql.DataFrames or pyspark.RDDs.

Describe alternatives you've considered

I am just implementing this myself at the moment. I don't see an alternative to this function.

Additional context

The numpy.ndarray in for example L_train returned by SparkLFApplier may have to be serialized into something else so Spark can use it. SparkLFApplier could then optionally return this format, if it makes that easier.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.