snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Function Zoo for some common applications. #1493

Closed mrbeann closed 5 years ago

mrbeann commented 5 years ago

The LF is a fantastic idea, and it can be shared across different users, so I think a Function Zoo (similar to Model Zoo) will be very useful.

eggie5 commented 5 years ago

That's a really good idea!

Here's the first contribution:

def address():
    import re

    exp = "\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)"
    regex = re.compile(exp, re.IGNORECASE)

    @labeling_function()
    def _address(x):
        if regex.search(x["query"].lower()):
            return ADDRESS
        else:
            return ABSTAIN

    return _address
dataframing commented 5 years ago

I think if you'd like to contribute cross-domain utilities, maybe make a PR and add it under snorkel/contrib? Check out the README here.

Edit: @eggie5 great addition! I wonder how it compares with a custom-build library for address parsing like usaddress. Have you tried that out? Would be curious to see if it helps :)

bhancock8 commented 5 years ago

Thanks for raising this! An LF Zoo is something we’ve been excited about for a while as well. It's on our roadmap, but we want to make sure it's done with proper organization, testing, etc. so that it remains clean, maintainable, and general. We would most likely host it in another repo, leaving the snorkel repo focused on the core functionality. We’ll be sure to post to the Snorkel mailing list once we have anything to announce!

mrbeann commented 5 years ago

Yeah, this ZOO is more difficult to build and needs elaborate designs. Here are a few things I can come up with now, hope it can help.

  1. The common task may define hierarchical.
  2. A new repo with samples and contribution guidelines.
  3. Some automate quality testing can be provided to test new PR.

Hope this can be build to make snorkel more practical. And feel free to close this issue.

vincentschen commented 4 years ago

Hi @mrbeann @eggie5 @dataframing — I wanted loop around to share some scaffolding for the snorkel-zoo that you all alluded to earlier: https://github.com/snorkel-team/snorkel-zoo

Feel free to open a PR contributions in this repo — excited to see what LFs you've had in mind!