snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.82k stars 857 forks source link

How to create training data for NER task using snorkel ? #1254

Open thak123 opened 5 years ago

thak123 commented 5 years ago

I want to create a dataset using the snorkel labelling function but I am not able to find any links. I want to train a NER model using the above data.

Can anyone tell me how to proced

Mageswaran1989 commented 5 years ago

@thak123 Follow the link https://github.com/HazyResearch/snorkel/issues/838 You will find following notebooks: 1.Crowdsourced_Sentiment_Analysis 2. Categorical_Classes

But I am doubtful on the area of tagging table data from PDFS/Receipts

hpeiyan commented 5 years ago
  1. Crowdsourced_Sentiment_Analysis
  2. Categorical_Classes

Hi Mageswaran. I found the link you posted is not found.

ajratner commented 5 years ago

Hi @thak123 while you can hopefully look at some of the existing tutorials to help you in the interim, we're actually planning to release an NER-specific tutorial soon! Marking as "feature request" and will leave open till this is done

marctorsoc commented 5 years ago

Hi @ajratner , I'm quite interested in this feature, do you have an expected timeline for the release of those tutorials? not a hard deadline, but just to know if some weeks, months, years...

christopheratfarmjournal commented 5 years ago

Hi @ajratner, I'm very interested in this feature. Any idea when the tutorial may be released? Here we are 2 months after your previous mention . . . does it still look months away?

maciejbiesek commented 5 years ago

Any update on this issue?

pfllo commented 5 years ago

I found 2 papers in the snorkel resources page that tackles the NER task. The SwellShark paper, handles the overlapping candidate problem in NER using the Maximum Marginal Likelihood Approach. The MeTaL paper uses the Matrix Completion-Style Approach, but I can't find any details on handling the overlapping candidate problem in NER. @ajratner Could you give some hints on how to handle the overlapping candidate problem in the matrix completion-style approach, so that we can try out the NER task before the tutorial comes out?

thak123 commented 4 years ago

any update on this issue ?

blah-crusader commented 4 years ago

Also interested.. C'mon guys! :D

jason-fries commented 4 years ago

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:

import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
    m = {}
    for i in range(len(sentence)):
        for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
            term = ' '.join(sentence[i:j])
            if term in dictionary:
                m.update({idx:1 for idx in range(i,j+1)})
    return m

def create_token_L_mat(Xs, Ls, num_lfs):
    """
    Create token-level LF matrix from LFs indexed by sentence
    """
    Yws = []
    for sent_i in range(len(Xs)):
        ys = dok_matrix((len(Xs[sent_i]), num_lfs))
        for lf_i in range(num_lfs):
            for word_i,y in Ls[sent_i][lf_i].items():
                ys[word_i, lf_i] = y
        Yws.append(ys)
    return csr_matrix(vstack(Yws))

# labeling functions
def LF_is_location(s):
    locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
    matches = dict_match(s, locations)
    return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_company(s):
    companies = {"Apple", "Apple, Inc."}
    matches = dict_match(s, companies)
    return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
    return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
    "Apple, Inc. is headquartered in Cupertino, California .".split(),
    "Explore the very best of the Big Apple .".split(),
]

lfs = [
    LF_is_location,
    LF_is_company,
    LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model 

Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

ajratner commented 4 years ago

@jason-fries thanks so much! Just to make sure it's clear: this repo has been generously maintained primarily by researchers like @jason-fries, and we are in general very capacity limited in terms of major changes to the repo. As such, we currently don't have a timeline on an NER tutorial. Contributions are very welcome though!

To additionally be clear: our policy for the issues page is that questions and comments are great, but demands such as "cmon guys" are not appropriate usage. Thanks for your understanding!

ajratner commented 4 years ago

And also just to be very clear: we all really want to put more stuff out here... we're working on it, and so grateful to all of you on the issues page for your patience, enthusiasm, and support in trying Snorkel out in the meantime!!! :)

blah-crusader commented 4 years ago

Thanks a lot for this response @jason-fries ! @ajratner apologies for coming across impatient/rude; I've been really amazed by the current release, and the corresponding research papers and did not mean anything other than: "I'm also super interested in staying up to date on the topic".

Thanks!

marctorsoc commented 4 years ago

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:

import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
   m = {}
   for i in range(len(sentence)):
       for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
           term = ' '.join(sentence[i:j])
           if term in dictionary:
               m.update({idx:1 for idx in range(i,j+1)})
   return m

def create_token_L_mat(Xs, Ls, num_lfs):
   """
   Create token-level LF matrix from LFs indexed by sentence
   """
   Yws = []
   for sent_i in range(len(Xs)):
       ys = dok_matrix((len(Xs[sent_i]), num_lfs))
       for lf_i in range(num_lfs):
           for word_i,y in Ls[sent_i][lf_i].items():
               ys[word_i, lf_i] = y
       Yws.append(ys)
   return csr_matrix(vstack(Yws))

# labeling functions
def LF_is_location(s):
   locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
   matches = dict_match(s, locations)
   return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_company(s):
   companies = {"Apple", "Apple, Inc."}
   matches = dict_match(s, companies)
   return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
   return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
   "Apple, Inc. is headquartered in Cupertino, California .".split(),
   "Explore the very best of the Big Apple .".split(),
]

lfs = [
   LF_is_location,
   LF_is_company,
   LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model 

Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

thanks for this. I already experimented with a similar approach in the past, but it's really useful to me to have confirmation that this actually works quite well and there's not much difference (given enough resources) as compared to something specific to sequence data 👍

raj5287 commented 4 years ago

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:

import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
   m = {}
   for i in range(len(sentence)):
       for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
           term = ' '.join(sentence[i:j])
           if term in dictionary:
               m.update({idx:1 for idx in range(i,j+1)})
   return m

def create_token_L_mat(Xs, Ls, num_lfs):
   """
   Create token-level LF matrix from LFs indexed by sentence
   """
   Yws = []
   for sent_i in range(len(Xs)):
       ys = dok_matrix((len(Xs[sent_i]), num_lfs))
       for lf_i in range(num_lfs):
           for word_i,y in Ls[sent_i][lf_i].items():
               ys[word_i, lf_i] = y
       Yws.append(ys)
   return csr_matrix(vstack(Yws))

# labeling functions
def LF_is_location(s):
   locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
   matches = dict_match(s, locations)
   return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_company(s):
   companies = {"Apple", "Apple, Inc."}
   matches = dict_match(s, companies)
   return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
   return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
   "Apple, Inc. is headquartered in Cupertino, California .".split(),
   "Explore the very best of the Big Apple .".split(),
]

lfs = [
   LF_is_location,
   LF_is_company,
   LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model 

Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

@jason-fries thanks for this, but could you please tell how to train the MajorityLabelVoter or LableModel , since I am getting error with both these Methods and even with LFAnalysis(L=L_, lfs=lfs).lf_summary() . I am guessing, may be this is because of sparse matrix since the error is NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported > So could you please help me out here, what to do next?

alvin-c-shih commented 4 years ago

@raj5287 MajorityLabelVoter requires L be integer type. LFAnalysis requires the matrix be dense. Other operations would prefer np.array instead of np.matrix.

Try this as a tactical fix:

L = np.asarray(L.astype(np.int8).todense())
rjurney commented 2 years ago

The thing to do here is to use skweak, not Snorkel. It is a commercial tool now and investments in this area are going into other projects.