snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Using a binary classifier as a custom label function #1668

Closed cuent closed 3 years ago

cuent commented 3 years ago

Issue description

I did a binary classifier and create a LabelingFunction which labels if the prediction score is above a threshold. I load the model as a preprocessing step where I do inference using tf model and transformers. However, when I execute PandasParallelLF, either it load models several times and I ran out of memory if the inference is inside of get_label or it cannot pickle TypeError: can't pickle _thread.RLock objects if the inference is outside of the get_label function (as in the example).

models = {...}

def get_label(input, label_id):
   if label_id and input > 0.8:
      return label_id
 return ABSTAIN    

LabelingFunction(
            name=f"lf_binary_clf_{id}",
            f=get_label,
            resources=dict(label=label_id),
            pre=[lambda input: models[label_id].predict(encode(tokenizer, [input.payload]))]
        )

If I want to select a clf, what should be the proper way to do inference?

vkrishnamurthy11 commented 3 years ago

@cuent Thanks for letting us know. We are taking a look at this and will get back to you ASAP.

vkrishnamurthy11 commented 3 years ago

@cuent At the moment, we don't support this feature. Perhaps you can try a workaround where you use the model to create a list or dict of predictions before hand and then look up the appropriate value during the LF.

cuent commented 3 years ago

Thanks, @vkrishnamurthy11. I created a rest service for inference and wrapped it with a label function. I'll close the issue. THanks for your answer :)

kbagalo commented 2 years ago

@cuent Can you please share the code for what worked? re: The Label function and how you wrapped the same?