snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Create utilitiy functions for keyword labeling functions #1579

Closed rjurney closed 4 years ago

rjurney commented 4 years ago

Keyword Labeling Function Utilities

The Spam tutorial has the following code for creating Labeling Functions, which are the most common (and surprisingly powerful) type of LF:

from snorkel.labeling import LabelingFunction

def keyword_lookup(x, keywords, label):
    if any(word in x.text.lower() for word in keywords):
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=SPAM):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

There should be an interface for a KeyWordLabelingFunction that incorporates this capability, as it is the most common usage pattern.

Describe the solution you'd like

I have made improvements to this code to have the option of searching one or more fields, and for creating separate LFs per word:

from snorkel.labeling import LabelingFunction

def keyword_lookup(x, keywords, field, label):
    """Given a list of tuples, look for any of a list of keywords"""
    if field in x and x[field] and any(word.lower() in x[field].lower() for word in keywords):
        return label
    return ABSTAIN

def make_keyword_lf(keywords, field, label=ABSTAIN, separate=False):
    """Given a list of keywords and a label, return a keyword search LabelingFunction"""
    prefix = 'separate_' if separate else ''
    name = f'{prefix}{keywords[0]}_field_{field}'        
    return LabelingFunction(
        name=name,
        f=keyword_lookup,
        resources=dict(
            keywords=keywords,
            field=field,
            label=label,
        ),
    )

def make_keyword_lfs(keywords, fields, label=ABSTAIN, separate=False):
    """Given a list of keywords and fields, make one or more LabelingFunctions for the keywords with each field"""
    lfs = []
    for field in fields:

        # Optionally make one LF per keyword
        if separate:
            for i, keyword in enumerate(keywords):
                lfs.append(
                    make_keyword_lf(
                        [keyword],
                        field,
                        label=label,
                        separate=separate,
                    )
                )
        # Optionally group keywords in a single LF for each field
        else:
            lfs.append(
                make_keyword_lf(
                    keywords,
                    field,
                    label=label,
                )
            )
    return lfs

Therefore I propose the method snorkel.labeling.lf.nlp.keyword_labeling_function with the following interface:

snorkel.labeling.lf.nlp.keyword_labeling_functions(
    fields=['name', 'description']
    terms=['sdk', 'software', 'program']
    separate=False,
)

    # One LF for any of the terms for each of the fields. Note: separate=False here.
    returns [
        LabelingFunction(name='name_sdk_software_program'),
        LabelingFunction(name='description_sdk_software_program'),
    ];

If separate=True new LFs are created for each term, otherwise OR is used. The method could than a method=['or', 'and'] to enable multiple phrase matching.

I don't know if this is the right interface but this seems the right method.

Describe alternatives you've considered

This is the only alternative I can think of without disruptive changes to the LabelingFunction interface.

Additional context

I use this code enough that I'd be happy to write the patch.

vincentschen commented 4 years ago

These are great suggestions!

Would you mind adding making these contributions, potentially as LF generators, in a pull request to https://github.com/snorkel-team/snorkel-zoo?

rjurney commented 4 years ago

@vincentschen sure!

ajratner commented 4 years ago

@rjurney thanks for this! Excited here!