snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

How do I use a continuous feature/variable/column of a dataset as a Labeling function in snorkel? #1600

Closed soumya-ranjan-sahoo closed 3 years ago

soumya-ranjan-sahoo commented 4 years ago

Hi Team,

I want to use a continuous normalized column of my dataset as one of the signals for training my snorkel model. How do I accomplish this? I can only see examples with categorical columns/variables being used as labeling functions. Does it mean I need t set some form of a threshold for the continuous variable?

Thanks in advance!

bhancock8 commented 4 years ago

Yep, that's right! Labeling functions output a discrete label (or abstain), so if they're based on one or more continuous outputs, you'll need to map those into discrete labels somehow (such as with a threshold, bounding box, or other numeric comparison).

soumya-ranjan-sahoo commented 4 years ago

Alright. Thanks for your feedback. Well, how should I go about a column with already having binary labels (0,1) to be used as a labeling function? An example should help :)

bhancock8 commented 4 years ago

I may be misunderstanding here, but if you have a column that already has binary labels, then you can basically just map those to labels in the labeling function. So for a problem with classes [A, B, C] and a column called "my_column" with binary values, then you could write (in pseudocode)"lf(x): if x['my_column'] == 1, then label A else abstain."

soumya-ranjan-sahoo commented 4 years ago

Okay! So in the above example, the data point with abstain doesn't get a label, right? Basically for a binary classifier with labels (1,0) with positive = 1, abstain = 0, and if I define a labeling function as - def LF1(x): return POSITIVE if x.TextLength > 10 else ABSTAIN, I will get a 100% coverage with this labeling function? Am I right with this as abstain in my case would be 0 but in your example it won't.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

bhancock8 commented 3 years ago

Ah, that's the difference. For a binary classification problem, your LFs will output one of {positive = 1, negative = 0, abstain = -1}, where abstaining meanings no label is added to the data point from that LF. And then your classifier will output one of {positive = 1, negative = 0}. Voting negative is different than abstaining from voting.