snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.79k stars 858 forks source link

Training data for training a NER model #599

Closed arjasethan1 closed 7 years ago

arjasethan1 commented 7 years ago

Hi,

I have two questions about the ways of using the snorkel:

1) I am trying to train a Stanford Core-NLP NER model for picking all the software names and vendor names from the comments and descriptions of a software ( raw text). For this I tried to do some manual labeling which is very time consuming but I can clearly see the improvement in accuracy when keep increasing the training data. In this case I am trying to use Snorkel to produce some training data for me but it seems like(in Tutorials) it is already using Core-NLP NER models and generating training data for abstracting relation between two entities. Is their a way to use snorkel for creating train data for abstracting entities rather than their relation?

2) I have also used DeepDive to abstract relations between entities, when I skim through the tutorials I am not able to find much difference between DeepDive and Snorkel. Is Snorkel is the python version of DeepDive ?

ajratner commented 7 years ago

Hi @arjasethan1,

Great questions! Responding in order:

(1) Yes, Snorkel can definitely be used for this! (in fact, we have a paper on entity tagging which will be posted very soon...). At a high level, the goal of a Snorkel application is to train a classifier (the "end discriminative model") to classify possible or candidate extractions. In the intro tutorial, these candidates are potential mentions of spouse relations, but they could also just be mentions of single entities. For example, if you were just trying to tag mentions of people in the intro tutorial (part 2), instead of pairs of people that are spouses, you would instead just do:

from snorkel.models import candidate_subclass
Person = candidate_subclass('Person', ['person1'])

Basically everything else would be the same for the rest of the process, other than that your candidate class would now be different, so you would write slightly different types of labeling functions, etc.

And yes, we do run CoreNLP's NER tagger during preprocessing, which tags a basic set of entity types (e.g. PERSON, ORG, etc.), but you can use Snorkel to tag more specific / less standard entity types where hand-labeled training data is not readily available, such as in your scenario!

(2) The main differences between Snorkel and DeepDive are (1) that Snorkel uses data programming (https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly) to model noisy training data, adding in a new modeling stage for this (the generative model in intro tutorial 4), (2) Snorkel is written in Python and supports a simple Jupyter notebook interface, and (3) Snorkel focuses on independent classification problems rather than more complex factor graphs (at the current moment at least!)

Hope this answers your questions!

arjasethan1 commented 7 years ago

Hi @ajratner, Thanks for you and your team for building this great tool and making it public and inspiring lot of young minds like mine.

That petty much cleared my doubt and seems like snorkel can solve my problem, But I have few more doubts:

Do I still need to run Core-NLP(in this scenario) and abstract sentences, parts of speech, and NER entities etc which might be good features in building the generative model or snorkel can abstract its own features?

And I want to train NER model for picking two classes(may be more) "Softwares" and "Vendors" can I just give two classes in this way ?

from snorkel.models import candidate_subclass Software = candidate_subclass('Software', ['software']) Vendor = candidate_subclass('Vendor', ['vendor'])

Thanks!!

ajratner commented 7 years ago

Hi @arjasethan1 that's great to hear! We'd love to hear how your project goes (both positive and negative), it's great feedback for us in our research!

You will still need to run at least some elements of CoreNLP (e.g. splitting the sentence into words, getting grammatical structure, etc.). We actually have an upcoming PR that will allow more customizability here in terms of what is run, but I would probably leave the defaults to start.

And yes, right now Snorkel is geared for binary extraction, so each class would be its own Snorkel classification model. You can still run in the same notebook or you could have two separate notebooks, either way should be fine!

thammegowda commented 7 years ago

Hi @arjasethan1 and @ajratner

I am working in the same area. Here is my approach (please correct me or suggest me anything I am missing)

I used POS tags with proper nouns NNP as candidates (but later found that POS tagger has many false negatives for proper nouns in my domain - so included all NN* ) @ajratner Thanks for the regex matcher, it came handy :-)

Used RegexMatchEach(attrib='pos_tags', rgx="NN.*") instead of PersonMatcher for generating the candidates.

Then I wrote set of labeling functions based on rules and distance supervision.

--

We actually have an upcoming PR that will allow more customizability here in terms of what is run, but I would probably leave the defaults to start.

These features would be very useful:

  1. Option to specify which annotators to use in the CoreNLP pipeline (so that we can exclude costly operations such as NER when we dont need them).
  2. Option to set other properties to corenlp, such as ner.model etc (so that we can specify our custom ner models trained using CRF classifier)
ajratner commented 7 years ago

@thammegowda sounds like a great start! And glad the regex matcher came in handy :)

Let's check back on this once the PR for this is in, can always add more fnality!

arjasethan1 commented 7 years ago

Hi @ajratner , Thanks for your reply and that makes lot of sense. I will defiantly let you guys know how my work went when its done.

thanks @thammegowda for your suggestions, Even I thought of same approach. Most of the entities which I want to pick are also Nouns, they can be a good candidates to start with. Thanks for letting me know about RegexMatchEach and CoreNLP options before wasting some time.

arjasethan1 commented 7 years ago

Hi, I labeled all the candidates from Development and Test datasets using the viewer but I couldn't able to figure out how to load those hand labeled data into a sparse matrix for further evaluation and training. Like this in the tutorial L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

I tried l_test = load_label_matrix(session=session, split=2) but it gave me <222x0 sparse matrix of type '<type 'numpy.float64'>' with 0 stored elements in Compressed Sparse Row format>

Thanks, Sethan.

arjasethan1 commented 7 years ago

I found it !! I misunderstood a bit about the gold labels, when I changed annotator_name in L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1) to my user name I am able to load the labeled data

Thanks.

thammegowda commented 7 years ago

Yes! For future visitors, to load the labels annotated using SentenceNgramViewer

from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name=os.environ['USER'], split=1)
ajratner commented 7 years ago

We have a note somewhere to make this clearer in the tutorial, will add soon!

arjasethan1 commented 7 years ago

Hi,

I finished writing some labeling functions for my use case and it seems like working well. I want to get this data( like in the form of sentences with labels saying which word is a software and which is not, like the data I am using in Viewer) to train my NER model, how can I do that ? I tried querying database to see if that information is saved in any of the tables but I am not able to figure it out. Any help would be appreciated.

Thanks!!

fsonntag commented 7 years ago

@arjasethan1 Maybe this code snippet is helpful, I have extended export code by @jason-fries for the brat tool. It writes the text and the annotations in separate files.

    def export_project(self, output_dir, positive_only_labels=True):
        """
        :param output_dir:
        :positive_only_labels
        :return:
        """
        os.makedirs(output_dir, exist_ok=True)
        candidates = self.session.query(Candidate).filter(Candidate.split == 0).all()
        doc_index = _group_by_document(candidates)
        snorkel_types = {type(c) for c in candidates}
        configuration_string = self._create_config_from_candidate_types(snorkel_types)
        # write the annotation file
        with open(os.path.join(output_dir, 'annotation.conf'), 'w') as conf_file:
            conf_file.write(configuration_string)

        # iterate over the documents
        for name in doc_index:
            # write the text
            with open(os.path.join(output_dir, f'{name}.txt'), 'w') as text_file:
                text = "".join([sentence.text for sentence in doc_index[name][0][0].sentence.document.sentences])
                text_file.write(text)

            # write the annotation file
            with open(os.path.join(output_dir, f'{name}.ann'), 'w') as ann_file:
                annotation_tuples = []
                for i, c in enumerate(doc_index[name]):
                    if positive_only_labels and c.training_marginal <= 0.5:
                        continue
                    sentence_start = sum(len(sentence.text) for sentence in c[0].sentence.document.sentences[:c[0].sentence.position])
                    char_start = sentence_start + c[0].char_start
                    char_end = sentence_start + c[0].char_end + 1
                    text = c[0].get_span()
                    annotation_tuples.append((c.__class__.__name__, char_start, char_end, text))

                annotation_tuples.sort(key=lambda tuple: tuple[1])
                lines = [f'T{i + 1}\t{annotation_tuple[0]} {annotation_tuple[1]} {annotation_tuple[2]}\t{annotation_tuple[3]}\n'for i, annotation_tuple in enumerate(annotation_tuples)]
                ann_file.writelines(lines)

So a candidate has a id, that will be the same as in your software table. There you can find a ..._id field, this is the id of the corresponding span. The span table then tells you the sentence and the position in the corresponding sentence.

fsonntag commented 7 years ago

Anyway I'm struggling with overlapping candidates and in the SwellShark paper they use a multinomial model with overlapping spansets. So far I could create those spansets, but I don't know how to feed those to the generative model. Previously we had L_train, which was of size #spans x #LFs and the model learned the weights. Now I have a bunch of spansets of size #LFs x #spans in that spanset and I don't really know what to do with them... Edit: I made a workaround by using the getting the span with the highest sum of labels in each spanset (if there are more than one, use all) and setting the labels of the others to 0. Works well and fast.

jason-fries commented 7 years ago

Hi @fsonntag, the current implementation of the generative model in Snorkel doesn't have multinomial support, but that should be added soon (see Issue 604). In the meantime, implementing some heuristic as you indicated can work pretty well. Another option is using the simple Naive Bayes generative model we used in the SwellShark paper https://github.com/HazyResearch/snorkel/blob/newftrs/snorkel/learning_mn.py

fsonntag commented 7 years ago

Thanks a lot for the answer, @jason-fries. I thought about using Naive Bayes, nevertheless I'm not sure what to use as an input. Those spansets all have a different size (depending on the number of spans) and I don't know how to put them together...

jason-fries commented 7 years ago

@fsonntag the linked multinomial Naive Bayes code is trained over the list of all spanset matrices X_hat, which can be of different sizes. If you use MnLogReg, you can pass in your list of spanset matrices and learn LF weights ('accuracies') as before. Marginals then consist of a multinomial distribution across all candidates within a given spanset.

fsonntag commented 7 years ago

Works without a flaw, thanks for the feedback :)

arjasethan1 commented 7 years ago

HI @fsonntag ,

Thanks for your reply!! How I can use this functionexport_project? do I need to define it under any of the class ? as it take self as the first input, or is it already defined in any of the class object ?

fsonntag commented 7 years ago

Check out this: https://github.com/HazyResearch/snorkel/blob/brat/snorkel/contrib/brat/tools.py https://github.com/HazyResearch/snorkel/blob/a1fc55e26c4a1d3f9660b95befd33c92bfded159/tutorials/cdr/CDR_Tutorial_BRAT_Export.ipynb Jason implemented another version and in the iPython notebook he shows how to use it. Maybe that's simpler :)

arjasethan1 commented 7 years ago

Thanks you very much @fsonntag for pointing me to this.

arturomp commented 6 years ago

Related to @arjasethan1's first question and the corresponding answer by @ajratner, I'm wondering what the appropriate function is in order to do the RNN+LSTM training towards the end for mentions of single entities/events/etc. I'm running a slightly modified version of the intro tutorial.

I'm mostly concerned that the context obtained within the for loop is a single word and that that could cause some sort of unexpected behaviour (for good or for bad). I looked at issue #838 but this isn't addressed there either.

I'm using a single-event candidate:

from snorkel.models import candidate_subclass
typing = candidate_subclass('typing', ['action'])

and in the "Training an End Extraction Model," I'm currently using a slightly modified TextRNN() (below) because reRNN() required two arguments and would (obviously) fail with an IndexError. The minor modification required involved extract the text from a Span object since it doesn't have a text attribute. Still not sure this is the right approach.

from snorkel.models import Span

class TextRNN(RNNBase):
    """TextRNN for strings of text."""
    def _preprocess_data(self, candidates, extend=False):
        """Convert candidate sentences to lookup sequences

        :param candidates: candidates to process
        :param extend: extend symbol table for tokens (train), or lookup (test)?
        """
        if not hasattr(self, 'word_dict'):
            self.word_dict = SymbolTable()
        data, ends = [], []
        for candidate in candidates:
            if type(candidate.get_contexts()[0]) == Span:
                toks = candidate.get_contexts()[0].get_span().split()
            else:
                toks = candidate.get_contexts()[0].text.split()
            # Either extend word table or retrieve from it
            f = self.word_dict.get if extend else self.word_dict.lookup
            data.append(np.array(list(map(f, toks))))
            ends.append(len(toks))
        return data, ends
fsonntag commented 6 years ago

@arturomp They also have a tagging RNN implemented for that purpose, but I think it's slightly outdated and you also have to modify the code a little to get it running. In the pca branch, there is a Word-Char LSTM, which works fine for entity tagging. I customized it to an LSTM model that reads the left and right context and the characters of the candidate, it worked best in my use case. You can check it out here if you're interested.

pidugusundeep commented 6 years ago

@thammegowda According to this -> https://github.com/HazyResearch/snorkel/issues/599#issuecomment-296262029 i understood that we can have a regex for Noun but this will internally search the regex based on the tokens so i want to get a combination of co-occuring pos tags like NN,NN or NNP,NNP or NNP,NN and this is the regex expression i wrote to identify ((NN\sNN)|(NNP\sNNP)|(NNP\sNN)).* but then its not identifying even a single candidate in the documents is my regex format incorrect or do i have to modify the existing regex to accept a sentence instead of tokens, iam working with some specific domain keywords which i want to identify so i dont have an option to go with NER tagging(coreNLP) also so can you please help me.

thammegowda commented 6 years ago

@pidugusundeep I think regex matcher matches one token at a time (I could be wrong), so it won't be able to match two tokens since \s never occur within POS tags. If you need context (previous and next tokens) then labeling functions should be able to help you. Suggestion: break your task into two phases:

  1. first phases match all the nouns.
  2. your labeling function may use context to filter the co-occurring nouns

I hope that helps

Anirudh-Muthukumar commented 6 years ago

@thammegowda @arjasethan1 Can you just brief me on how to load gold labeled data? I have a hand labeled data for few observations in file. How am I supposed to use it to perform analysis on the model?

Thanks.

thammegowda commented 6 years ago

@Anirudh-Muthukumar did you try the code snippet I posted here https://github.com/HazyResearch/snorkel/issues/599#issuecomment-297544773

Anirudh-Muthukumar commented 6 years ago

Yes @thammegowda I tried the same snippet here.

from snorkel.annotations import load_gold_labels L_gold_dev = load_gold_labels(session, annotator_name=os.environ['USER'], split=1)

This is how my script looks.

Output: <2632x0 sparse matrix of type '<type 'numpy.float64'>' with 0 stored elements in Compressed Sparse Row format>

Are there any changes to be made?