Open thak123 opened 5 years ago
@thak123 Follow the link https://github.com/HazyResearch/snorkel/issues/838 You will find following notebooks: 1.Crowdsourced_Sentiment_Analysis 2. Categorical_Classes
But I am doubtful on the area of tagging table data from PDFS/Receipts
- Crowdsourced_Sentiment_Analysis
- Categorical_Classes
Hi Mageswaran. I found the link you posted is not found.
Hi @thak123 while you can hopefully look at some of the existing tutorials to help you in the interim, we're actually planning to release an NER-specific tutorial soon! Marking as "feature request" and will leave open till this is done
Hi @ajratner , I'm quite interested in this feature, do you have an expected timeline for the release of those tutorials? not a hard deadline, but just to know if some weeks, months, years...
Hi @ajratner, I'm very interested in this feature. Any idea when the tutorial may be released? Here we are 2 months after your previous mention . . . does it still look months away?
Any update on this issue?
I found 2 papers in the snorkel resources page that tackles the NER task. The SwellShark paper, handles the overlapping candidate problem in NER using the Maximum Marginal Likelihood Approach. The MeTaL paper uses the Matrix Completion-Style Approach, but I can't find any details on handling the overlapping candidate problem in NER. @ajratner Could you give some hints on how to handle the overlapping candidate problem in the matrix completion-style approach, so that we can try out the NER task before the tutorial comes out?
any update on this issue ?
Also interested.. C'mon guys! :D
The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L
as tokens X LFs
. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.
For example, you could do very simple NER LOCATION
model (using binary/IO tagging) as follows:
import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix
ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0
# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
m = {}
for i in range(len(sentence)):
for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
term = ' '.join(sentence[i:j])
if term in dictionary:
m.update({idx:1 for idx in range(i,j+1)})
return m
def create_token_L_mat(Xs, Ls, num_lfs):
"""
Create token-level LF matrix from LFs indexed by sentence
"""
Yws = []
for sent_i in range(len(Xs)):
ys = dok_matrix((len(Xs[sent_i]), num_lfs))
for lf_i in range(num_lfs):
for word_i,y in Ls[sent_i][lf_i].items():
ys[word_i, lf_i] = y
Yws.append(ys)
return csr_matrix(vstack(Yws))
# labeling functions
def LF_is_location(s):
locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
matches = dict_match(s, locations)
return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}
def LF_is_company(s):
companies = {"Apple", "Apple, Inc."}
matches = dict_match(s, companies)
return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}
def LF_is_titlecase(s):
return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}
# training set
sents = [
"Apple, Inc. is headquartered in Cupertino, California .".split(),
"Explore the very best of the Big Apple .".split(),
]
lfs = [
LF_is_location,
LF_is_company,
LF_is_titlecase
]
# apply labeling functions and transform label matrix
L = [[lf(s) for lf in lfs] for s in sents]
L = create_token_L_mat(sents, L, len(lfs))
# train your Snorkel label model
Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.
When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.
As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.
@jason-fries thanks so much! Just to make sure it's clear: this repo has been generously maintained primarily by researchers like @jason-fries, and we are in general very capacity limited in terms of major changes to the repo. As such, we currently don't have a timeline on an NER tutorial. Contributions are very welcome though!
To additionally be clear: our policy for the issues page is that questions and comments are great, but demands such as "cmon guys" are not appropriate usage. Thanks for your understanding!
And also just to be very clear: we all really want to put more stuff out here... we're working on it, and so grateful to all of you on the issues page for your patience, enthusiasm, and support in trying Snorkel out in the meantime!!! :)
Thanks a lot for this response @jason-fries ! @ajratner apologies for coming across impatient/rude; I've been really amazed by the current release, and the corresponding research papers and did not mean anything other than: "I'm also super interested in staying up to date on the topic".
Thanks!
The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix
L
astokens X LFs
. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.For example, you could do very simple NER
LOCATION
model (using binary/IO tagging) as follows:import numpy as np from scipy.sparse import dok_matrix, vstack, csr_matrix ABSTAIN = -1 LOCATION = 1 NOT_LOCATION = 0 # helper functions def dict_match(sentence, dictionary, max_ngrams=4): m = {} for i in range(len(sentence)): for j in range(i+1, min(len(sentence), i + max_ngrams) + 1): term = ' '.join(sentence[i:j]) if term in dictionary: m.update({idx:1 for idx in range(i,j+1)}) return m def create_token_L_mat(Xs, Ls, num_lfs): """ Create token-level LF matrix from LFs indexed by sentence """ Yws = [] for sent_i in range(len(Xs)): ys = dok_matrix((len(Xs[sent_i]), num_lfs)) for lf_i in range(num_lfs): for word_i,y in Ls[sent_i][lf_i].items(): ys[word_i, lf_i] = y Yws.append(ys) return csr_matrix(vstack(Yws)) # labeling functions def LF_is_location(s): locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"} matches = dict_match(s, locations) return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))} def LF_is_company(s): companies = {"Apple", "Apple, Inc."} matches = dict_match(s, companies) return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))} def LF_is_titlecase(s): return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))} # training set sents = [ "Apple, Inc. is headquartered in Cupertino, California .".split(), "Explore the very best of the Big Apple .".split(), ] lfs = [ LF_is_location, LF_is_company, LF_is_titlecase ] # apply labeling functions and transform label matrix L = [[lf(s) for lf in lfs] for s in sents] L = create_token_L_mat(sents, L, len(lfs)) # train your Snorkel label model
Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.
When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.
As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.
thanks for this. I already experimented with a similar approach in the past, but it's really useful to me to have confirmation that this actually works quite well and there's not much difference (given enough resources) as compared to something specific to sequence data 👍
The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix
L
astokens X LFs
. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.For example, you could do very simple NER
LOCATION
model (using binary/IO tagging) as follows:import numpy as np from scipy.sparse import dok_matrix, vstack, csr_matrix ABSTAIN = -1 LOCATION = 1 NOT_LOCATION = 0 # helper functions def dict_match(sentence, dictionary, max_ngrams=4): m = {} for i in range(len(sentence)): for j in range(i+1, min(len(sentence), i + max_ngrams) + 1): term = ' '.join(sentence[i:j]) if term in dictionary: m.update({idx:1 for idx in range(i,j+1)}) return m def create_token_L_mat(Xs, Ls, num_lfs): """ Create token-level LF matrix from LFs indexed by sentence """ Yws = [] for sent_i in range(len(Xs)): ys = dok_matrix((len(Xs[sent_i]), num_lfs)) for lf_i in range(num_lfs): for word_i,y in Ls[sent_i][lf_i].items(): ys[word_i, lf_i] = y Yws.append(ys) return csr_matrix(vstack(Yws)) # labeling functions def LF_is_location(s): locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"} matches = dict_match(s, locations) return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))} def LF_is_company(s): companies = {"Apple", "Apple, Inc."} matches = dict_match(s, companies) return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))} def LF_is_titlecase(s): return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))} # training set sents = [ "Apple, Inc. is headquartered in Cupertino, California .".split(), "Explore the very best of the Big Apple .".split(), ] lfs = [ LF_is_location, LF_is_company, LF_is_titlecase ] # apply labeling functions and transform label matrix L = [[lf(s) for lf in lfs] for s in sents] L = create_token_L_mat(sents, L, len(lfs)) # train your Snorkel label model
Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.
When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.
As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.
@jason-fries thanks for this, but could you please tell how to train the MajorityLabelVoter
or LableModel
, since I am getting error with both these Methods and even with LFAnalysis(L=L_, lfs=lfs).lf_summary()
. I am guessing, may be this is because of sparse matrix since the error is NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
> So could you please help me out here, what to do next?
@raj5287 MajorityLabelVoter
requires L
be integer type. LFAnalysis
requires the matrix be dense. Other operations would prefer np.array
instead of np.matrix
.
Try this as a tactical fix:
L = np.asarray(L.astype(np.int8).todense())
The thing to do here is to use skweak, not Snorkel. It is a commercial tool now and investments in this area are going into other projects.
I want to create a dataset using the snorkel labelling function but I am not able to find any links. I want to train a NER model using the above data.
Can anyone tell me how to proced