snorkel-team / snorkel-tutorials

A collection of tutorials for Snorkel
https://www.snorkel.org/use-cases/
Apache License 2.0
388 stars 181 forks source link

Is it ok to have different features when creating labelling functions VS training a classifier? #193

Closed jennxf closed 4 years ago

jennxf commented 4 years ago

I have a set of features to build labelling functions (set A) and another set of features to train a sklearn classifier (set B)

The generative model will output a set of probabilisitic labels which i can use to train my classifier. Do i need to add in the features (set A) that i used for the labelling functions into my classifier features? (set B)

I was referencing the snorkel spam tutorial and i did not see them use the features in the labelling function set to train a new classifier.

As seen in cell 47, featurization is done entirely using a CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())

X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

And then straight to fitting a keras model:

# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])

keras_model.fit(
    x=X_train,
    y=probs_train_filtered,
    validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
    callbacks=[get_keras_early_stopping()],
    epochs=50,
    verbose=0,
)
paroma commented 4 years ago

Good question - you do not need to add in the features (set A) that you used for LFs into the classifier features. In order to prevent the end model from simply overfitting to the labeling functions, it is better if the features for the LFs and end model (set A and set B) are as different as possible

yeamusic21 commented 4 years ago

If you have a labeling function that say does the following: "look to see if the word 'channel' is present", and then my model is a CountVectorizer into a Logistic, isn't this textbook target leakage? Seems like you need to make sure you keep your (set A) features from your model, and you need to be careful that they're not sneaking their way into your modeling features through design oversights, no?

yeamusic21 commented 4 years ago

Is this example not target leakage? Part of the definition of your label consists of specific bi-grams, and then your modeling features uses almost the same logic CountVectorizer(ngram_range=(1, 2), the only difference is you added a bit of noise when you combined your bi-gram function with some other functions, no?

This feels a bit like creating a online checkout model where the target uses snorkel and any basket > $25 usually results in a checkout. Then for my modeling features I say, one-hot encode basket price using various thresholds. The definition of my target is basically being included in my modeling features, which seems rather pointless, no? Unless your objective is to find the best way to aggregate your functions and them productionizing the result. So really the labeling functions are all you care about, and you could compute a simple max or average and act on that. But if you want to aggregate the functions in a way where consensus is optimal, then a generative model is great, but productionizing it is not ideal, so you build a model on your generative model result for production.

@labeling_function()
def lf_keyword_my(x):
    """Many spam comments talk about 'my channel', 'my video', etc."""
    return SPAM if "my" in x.text.lower() else ABSTAIN
train_text = df_train_augmented.text.tolist()
X_train = CountVectorizer(ngram_range=(1, 2)).fit_transform(train_text)