snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Labeling datasets without dev or test data #1539

Closed stenpiren closed 4 years ago

stenpiren commented 4 years ago

I see that for labeling training data, one requires dev and test data.

Wouldn't it better if we could create labels for the dataset without having to supply a dev or test sets?

The reason for this is that, in some domains, spam and not spam example is not really applicable. There are cases, especially in the cybersecurity domain that you have strings that do not really represent natural language. Labeling them in order to produce a dev or test set, is as good as labeling the whole dataset since you would require a combination of keywords representing each label.

In my current use case, I have more than 50 labels. Creating dev/test sets just for them would be a nightmare, and I would question myself: Isn't it better to just build keyword-based matching per label and expand that definition across the unlabelled data. Because, if I attempt to just do for a small subset and then expand, I will have even more work to write labeling functions in snorkel for 50 possible labels in my dataset and make necessary evaluations.

Did you think of such cases? How would you use Snorkel for this? Maybe snorkel wouldn't work well for "tight" domains like cyber security.

regstrtn commented 4 years ago

Hi, I have a similar problem, I have 1400 categories, out of which, I want to focus on atleast the top 30. I didn't get what you meant by "keyword-based matching per label".

stenpiren commented 4 years ago

Hi, I have a similar problem, I have 1400 categories, out of which, I want to focus on atleast the top 30. I didn't get what you meant by "keyword-based matching per label".

I mean, how else can you label your dataset when you have 50 labels, so as to have dev and test set in order to use snorkel for evaluation of the labeling functions.

In my use case I have zero labelled data and the only way to have it labelled is to find possible keywords per category and quickly assign labels this way. The only problem is that you may encounter cases where label matching a string that contains another keyword that should weight more for a different category than the one intended in the first place.

All I am thinking now is to create set of possible keywords that can potentially represent each category then while trying to match on the examples, I might have to weigh more only on cases were non-matching keywords out of matching keywords in a string do not occur in a set of keywords from other categories. If it occurs then probably I would ABSTAIN and do further analysis.

My main issue in short is to make use of snorkel for cases where you have no dev and test sets and your categories are enormous.

vincentschen commented 4 years ago

Ground truth labels are not necessary for the core Snorkel algorithm to run! We use them primarily to demonstrate a standard machine learning pipeline.

I say a bit more about this on our forum — let me know if this is more helpful! https://spectrum.chat/snorkel/help/examples-for-unlabeled-data~f8d053ba-8e65-423f-bdc6-d1426e71ec5d?m=MTU4MTEwMjUyMDcxOQ==

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.