snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Annotate multiple entity/relation in text dataset [example / guideline] #1078

Closed msank00 closed 5 years ago

msank00 commented 5 years ago

Hi Team,

Really appreciate this great project. I would like to know what kind of techniques / label functions I should be looking at to potentially annotate text data-set using Snorkel. The final end extraction model should be able to annotate some specific information like date, reference number, person name, organization name etc. In the Intro notebook, only one relation Spouse has been extracted. But can we do multiple relation/ entity extraction from the same document

Any relevant research papers / pointers to articles would be really helpful. If anyone in community has already tried doing this and would like to shed some light upon the overall approach that would be very helpful.

Presently I am running the example notebooks to understand the workflow. I have following questions related to the examples:

Q1. What is the purpose of the following piece of code Spouse.split == 2 , I found this in the 2nd Intro Notebook. what does split do ? Q2. Can multiple candidate_subclass be used for same document?

vumaasha commented 5 years ago

Split is to create the training and test splits. Split zero is used for training and split one for testing. For extracting multiple relations you probably need to look into snorkel metal , the multi task version of snorkel

bhancock8 commented 5 years ago

Glad you're enjoying the project, @msank00! We keep a collection of links to relevant papers on our landing page for the Snorkel projet: snorkel.stanford.edu. There you can find examples of extensions to Snorkel, follow-on projects, a new formulation of the generative model and support for multi-task learning/supervision in the Snorkel MeTaL project (https://github.com/HazyResearch/metal), etc. As for your other questions:

Q1: @vumaasha is correct: the split numbers are used to designate different tests (e.g., 0=train, 1=dev, 2=test). Q2: Yes, the database behind the SQLAlchemy layer can support multiple candidate_subclass types. We don't have any examples of doing that here in the main Snorkel repo. If you're interested in just having multiple pipelines, then you'll want to create a separate LabelModel for each individual task (and its corresponding LFs). If you believe the tasks have the potential to share information in a productive way, then as @vumaasha suggested, take a look at the Snorkel MeTaL extension, where we include support for labeling functions that weakly supervise multiple related tasks and multi-task end models.

msank00 commented 5 years ago

Thanks for your reply @bhancock8 I looked into Snorkel MeTal, but there I couldn't find any example related to text annotation.

Could you please let me know if there is any Snorkel MeTal example/ tutorial available related to text dataset? I have also raised an issue regarding that in the Snorkel MeTal repo.

Thanks in advance.

vincentschen commented 5 years ago

An NER tutorial is generally on our roadmap—feel free to propose a general dataset or push a PR if you'd like to take a stab at this!

bhancock8 commented 5 years ago

Yep, just to echo @vincentschen, we don't currently have an entity tagging tutorial (here or in MeTaL), but there wouldn't be anything fundamentally different about that setting. Your "candidates" that your classifying would now be individual tokens or spans and your labeling functions would weakly label those.

vincentschen commented 5 years ago

@msank00 @vumaasha — looping back around here, wanted to plug our new community forum, where there's an ongoing discussion about a potential NER tutorial! https://spectrum.chat/snorkel/tutorials/snorkel-tutorial-for-ner~34d56436-41e6-4216-a05b-59cf391a7fb3

Feel free to leave any thoughts/suggestions!

github-actions[bot] commented 5 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.