snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Understanding the output of LabelModel and the code behind it #1615

Closed pratikchhapolika closed 3 years ago

pratikchhapolika commented 4 years ago

I was going throught this document <https://www.snorkel.org/use-cases/01-spam-tutorial> and everything was good untill I cam to section 4. Combining Labeling Function Outputs with the Label Model and 5. Training a Classifier

from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

I have the following doubts, please let me know if my understanding is correct?

  1. The input to LabelModel is a matrix which is of dimension *total_training_samples labeling_fn_output** ?

  2. The output of LabelModel model is matrix of probabilites having dimension same as input: *total_training_samples labeling_fn_prob**?

  3. For every training sample it gives probability of each labeling function ? Then how would we know that the probabilities that we get for each labeling function for every data point belongs to which class?

from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_train, y=probs_train, L=L_train
)
  1. In the above code what is "probs_train". I cannot find the definition of this in the document?
from snorkel.utils import probs_to_preds

preds_train_filtered = probs_to_preds(probs=probs_train_filtered)
  1. In the above code probs_to_preds takes the maximum of the probabailities across the row? What would be final values of preds_train_filtered. Will it be an array of {0,1} ?

  2. Where could I see the implementation of LabelModel, probs_to_preds , PandasLFApplier ?

fredsala commented 3 years ago

Hi Pratik,

Your understanding is correct for 1. and 2.

For 3, the output of the label model is not about each labeling function. The label model produces a probabilistic estimate of the true label. So, for example, for a dataset with 3 points and binary class, the output might be [[0.6, 0.4], [0.3, 0.7], [1.0, 0.0]].

For 4, take a look at the notebook that the tutorial is referring to: https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb In [36], you’ll find the definition of probs_train

For 5, your understanding is correct. Note that it depends on your class size. If the cardinality of your task is binary, then, yes, you will get a binary output {0,1}.

For 6, you can find these here: Label Model: https://github.com/snorkel-team/snorkel/blob/master/snorkel/labeling/model/label_model.py

probs_to_preds method: https://github.com/snorkel-team/snorkel/blob/88b2579b8a4b22a6132f2e940a8a47949c73f9b8/snorkel/utils/core.py#L13

PandasLFApplier: https://github.com/snorkel-team/snorkel/blob/master/snorkel/labeling/apply/pandas.py#L51

Hope this helps!

paroma commented 3 years ago

Closing this for now, feel free to reopen if you have any other questions!

betiTG commented 2 years ago

@here, hello I went through this explanations was useful, thanks, but it did not cover the issue that I have. I am now want to know how Label Model works for multi-class instead of binary class? Since in my problem I have 7 classes with 13 Label Functions when I want to apply "fit"method, it gives me the error as:

"ValueError: L_train has cardinality 12, cardinality=7 passed in." which I believe is related to line 890-894 of the link: https://github.com/snorkel-team/snorkel/blob/master/snorkel/labeling/model/label_model.py

Based on your explanations and documentation, cardinality shows the number of classes but when I have different number of classes of LFs it gave me an error. My Snorkel version is : 0.9.8 and I used pip for its installation on Mac. Is there any enriched doc for multi-class labeling using Snorkel?

Thanks in-advance,

humzaiqbal commented 2 years ago

Hi betiTG, Snorkel works for multi-class problems just as well as binary. This error message suggests that you initialized the label model for a problem with 7 classes however the LF outputs you were passing in contained 12. If you have LFs that can vote on 12 possible classes you need to make sure the label model is initialized with cardinality=12 rather than 7. Hope this helps!

Best, Humza

Keramatfar commented 1 year ago

I guess your classes have not names 1...7 and probably you named them in a way that the last class named 13. If it is the case just use a mapping dictionary.