How the label model works

rkoystart commented 1 year ago

Hi, i was trying to understand how the label model works. Went through some videos in youtube and came to an understanding that the accuracies are calculated for every labelling functions and in turn the accuracies are calculated using the corrections between labelling functions as said in this video.

All the accuracies and correlation are computable using the formulation/equations mentioned in the above youtube video (which i have specified below), So there are no learnable parameters involved here but in the code for label_model.py , i could see training being carried out. What are we trying to learn since we can compute all the values directly.

    Formula for accuracy of labelling function lambda1 =  E(lambda1Y) = sqrt((E(lambda1 lambda2) * E(lambda1 lambda3)) / E(lambda2 lambda3))
    Formula for the correlation of labelling function i,j = E(lambda i, lambda j) = E(lambda i) * E(lambda j)
E(lambda) is the mean of the output given by the labelling function

Another thing i notices while going through the code is, self.mu being the only learnable parameter matrix of the dimension len of labelling function * len of cardinality x len of cardinality. Explaining what is self.mu, how the self.mu is being related to the accuracy, correlation matrix and what it exactly does would be useful to understand the behaviour of the label_model.

Would be helpful is there are some detailed documentation or information on how the label model works.

Correct me if any of my understanding is wrong.

fredsala commented 1 year ago

Thanks for the question!

There are multiple label models, depending on the type of modeling that is being done. Different label models have different learning techniques.

The first kind of label model you are referring to comes from here (https://arxiv.org/abs/2002.11955); it specifically models P(lambda=Y|Y=y), that is, the accuracy for each possible class of the true label. This is more coarse-grained, but it has the nice property that you have closed-form solutions based on the equations you described.

A richer model can learn P(lambda=a|Y=y), that is, the probability of each kind of error for each possible class of the true label. This is what we do in https://ojs.aaai.org/index.php/AAAI/article/view/4403, and the code that you are looking at implements this idea.

Here too, we have a set of algebraic relationships that must hold. SGD is used to learn the parameters that satisfy these relationships. These parameters (called mu) are exactly the P(lambda=a|Y=y) terms. This is also why we have the dimensionality that you mentioned: the mu vector has these probabilities for each labeling function, and for each of these, all the possible values of a and y, which gives the cardinality x cardinality dimensions.

To recover a single-parameter description of accuracy from mu, we just need to sum up probabiltiies weighted by the priors, i.e., P(lambda=Y) = sum_y (lambda=y|Y=y)P(Y=y). This just requires multiplying components from mu with the prior p and summing.

rkoystart commented 1 year ago

Thanks @fredsala for the explanation. can you please explain the terms a, y, Y, P(lambda=Y|Y=y), P(lambda=a|Y=y) just so that, me and others(who read this issue in future) will have a better understanding while going through this issue.

And another doubt that i got is, should labelling functions always return only one label or abstain. Cant labelling functions return any possible label or abstain ?. For example wrong labelling function is:

def labelling_function_1(x):
     if x in list1:
          return 1
     elif x in list2:
          return 2
     else:
          return -1

example of correct labelling function.

def labelling_function_1(x):
     if x in list1:
          return 1
     elif x in list2:
          return 1
     else:
          return -1

Had gone through this paper https://ojs.aaai.org/index.php/AAAI/article/view/4403, which takes 2 things into consideration a) observable part of the covariance matrix which is normallu of size no of labelling function * no of labelling function. b) the fact that the inverse of covariance matrix will contain zero values for the a pair of labelling function which is not dependent on each other.

But for the point b mentioned above, you must have a dependency graph depicting the labelling functions dependency, only then you will know which pair of labelling functions have incovariance equal to 0. But in the code i see a graph being created without any edges. So now how is graph with all the edges depicting the dependency being created. a

fredsala commented 1 year ago

Happy to expand!

Terms:

a is just a possible label (or possibly an abstain symbol). For example, for binary classification, a could be -1 or +1. In addition, if the labeling function abstains, we could have it take on the value 0.
y is a particular value of the true label. For example, for binary classification, it could be -1 or +1. Note that the true label cannot be an abstain.
Y is the random variable for the true label, using the standard upper-case notation for random variables. We can therefore write expressions like P(Y=y), which is the probability that the true label Y takes on the particular value y. For example, for y=1, P(Y=1) tells us the probability that the label takes on the value 1.
lambda is a labeling function output. It is a random variable as well.
Probabilities: P(lambda=Y|Y=y) is a conditional probability---given that the random variable Y takes on the value y, it is the probability that the labeling function then agrees with Y, so that it gets the correct answer. In other words, it's the chances of the labeling function predicting correctly for a value y of the true label. There is one such probability for each value of y. So for binary classification, the two probabilities would be P(lambda=-1|Y=-1) and P(lambda=1|Y=1).

The other probability is P(lambda=a|Y=y), which are the probabilities of predicting a particular value a (not necessarily the correct one) given the true label is of value y. This might include, for example, P(lambda=0|Y=1) or P(lambda=1|Y=-1).

For your other questions: (1) depends on the type of labeling function. "Unipolar" LFs always vote a particular value a or abstain. "Multipolar" LFs can return more than one value (or abstain). Both kinds are permitted, but often the logic is simpler for unipolar LFs. Your examples show a multipolar then a unipolar LF.

(2) You are right---if you have access to a dependency graph, you would use that. These can also be learned from the data (see http://proceedings.mlr.press/v97/varma19a/varma19a.pdf). However, if this has not been done, it is always possible to ignore the dependencies by creating an empty graph. Learning dependencies will typically improve performance.

rkoystart commented 1 year ago

Thanks for the response @fredsala. Will reopen the issue incase need any other info. :+1:

snorkel-team / snorkel

How the label model works #1719