snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Have an handle on the rest of the data from LFs & preprocessors #1573

Closed Wirg closed 4 years ago

Wirg commented 4 years ago

Is your feature request related to a problem? Please describe.

⚠️ I am using PandasLFApplier and it might be related to my way of using it. ⚠️ This is not my real use case but a NLP one that I think might clarify my issue. I am working on a CV task.

Each datapoint represents a sentence that might be true or false. To know if this sentence is true or false, I may want to use a trust indicator from :

To do that, I have to duplicated the whole data in my rows, or have the data coming in the applier globally accessible at the same level that all the LFs & preprocessor.

import pandas as pd
from snorkel.labeling import labeling_function, PandasLFApplier
from snorkel.preprocess import preprocessor

TRUE = 1
FALSE = 0
ABSTAIN = -1

DATA = pd.DataFrame([
    {"sentence": "I am convinced Illuminatis did it.", "author": "Me"},
    {"sentence": "Earth is flat.", "author": "Me"},
    {"sentence": "Sun is shining.", "author": "You"},
    {"sentence": "Henri is having fun.", "author": "You"},
])

@preprocessor()
def get_same_author_sentences(sentence):
    sentence.same_author_sentences = (
        DATA
        .loc[lambda df: df.text == sentence.text]
        .loc[lambda df: df.author == sentence.author]
    )
    return sentence

@labeling_function(pre=[get_same_author_sentences])
def are_other_sentences_from_the_same_author_speaking_about_illuminati(sentence):
    if sentence.same_author_sentences.sentence.str.lower().str.contains("illuminati").any():
        return FALSE
    return ABSTAIN

applier = PandasLFApplier([
    are_other_sentences_from_the_same_author_speaking_about_illuminati
])
predictions = applier.apply(DATA)

In this case, I can not properly reuse the applier because what comes into applier.apply has to be DATA.

Describe the solution you'd like

Being able to pass metadata, in the apply call that could be used by the different mappers.

@preprocessor()
def get_same_author_sentences(sentence, full_data):
    sentence.same_author_sentences = (
        full_data
        .loc[lambda df: df.text == sentence.text]
        .loc[lambda df: df.author == sentence.author]
    )
    return sentence

Ellipsis

predictions = applier.apply(input_data, full_data=input_data})

Describe alternatives you've considered

I have considered :

def apply_on_dataframe(df):
    DATA = df
    # define the LFS, processors
    applier.apply(DATA)
brahmaneya commented 4 years ago

Hi @Wirg ,

You could try using the resources property of labeling functions, to pass in DATA to each LF, and compute the same_author_sentences field within the LF.

Wirg commented 4 years ago

Hi @brahmaneya,

Thank you for your answer. I am not sure how to use that field. I have seen exemples where it's used to insert a knowledge base. For example in this tutorial :

https://github.com/snorkel-team/snorkel-tutorials/blob/d11063c7d5f8b235da1b24820da0ce50bf105c5d/spouse/spouse_demo.py#L98-L103

What I would like is to use data that I will insert when I use applier.apply(input_data) , this data would be call specific.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.