snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.79k stars 859 forks source link

discussion on the class of '0' #1093

Closed zegzag closed 5 years ago

zegzag commented 5 years ago

I find that the '0' may have special means in snorkel, which indicates the abstention of LFs. So I need to encode my categories beginning with 1 in snorkel. But in many case, '0' is used as one of the encoded categories. And since the index of GenerativeModel.marginals() , which indicate the encoded categories, begins from 0, things can be quite confusing.

For example: in the following pipeline

  1. I have L matrix from LFs. L.shape=(1000, 10), L.unique()=[0,1,2,3,4]. This means 10 LFs for 4 category classification and '0' denotes the abstention of LFs.
  2. Train generative model. gen_model=GenerativeModel() gen_model.train(L)
  3. I will get marginal distribution Y_marginal=gen_model.marginal(). But Y_marginal.shape=(1000, 4) instead of (1000, 5)
  4. The max probability index of Y_marginal denote the most likely category.......... but should 'plus 1'. This mean if I want to use the most likely category of Y_label from Y_marginal to train my discriminative model. I need Y_label=np.argmax(Y_marginal, axis=1) +1

So, I suggest that there can be some explanations of this phenomenon in snorkel documentation and tutorials.

ajratner commented 5 years ago

@zegzag good suggestion, thank you for this detailed feedback! This is super valuable for us as we work on the v0.9 refactor coming this summer. I'll leave this open and ping again once we release that, to see if we can address this great feedback in v0.9

vincentschen commented 5 years ago

Hi @zegzag — thanks again for the suggestion!! You might notice that in v0.9, we've changed the convention such that abstains are -1 labels. 🙌 Exactly to your point, we found that users were confusing categorical labels with our 0 convention.

As you play around with the repo, please don't hesitate to open additional issues, start discussions on our forum, etc. — we really appreciate the feedback!