snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Pandas version of crowdsourcing tutorial #1030

Closed tmerrittsmith closed 5 years ago

tmerrittsmith commented 6 years ago

I've edited the crowdsourcing tutorial to use pandas instead of spark. It was useful for me, so may be useful to others. Do you want it?

ajratner commented 6 years ago

@tmerrittsmith yes!! This would be awesome! We definitely want to keep the spark one, but could have the pandas version as a separate notebook? Thanks!!

tmerrittsmith commented 6 years ago

Cool. Yeah the spark one is definitely worth keeping - it's like the Ferrari version, where mine is the second hand hatchback...

I'll tidy up what I've got, but can provide two possibilities: 1) A straight conversion of the crowdsourcing tutorial (exactly the same data and results), just using pandas where spark was used (to be honest, I didn't use pandas that much apart from joining the csvs at the beginning)

2) Something more similar to the intro tutorial, but for tabular data. We wrote some labelling functions for a UCI repository dataset, and then used snorkel's generative model to resolve the labels.

What's the best way to get them to you, once they're ready?

ajratner commented 6 years ago

Hah :) . Both of these sound awesome- whatever you can send, we'll look over! And best way would be a PR (separate ones for each). Thanks!!

hanzlfs commented 5 years ago

@ajratner is this still on-going or existing in any PR? Thanks!!

tmerrittsmith commented 5 years ago

Sorry, haven't got round to making the PR - I'll do that today.

On Wed, 16 Jan 2019, 23:26 Zhonglin Han <notifications@github.com wrote:

@ajratner https://github.com/ajratner is this still on-going or existing in any PR? Thanks!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HazyResearch/snorkel/issues/1030#issuecomment-454983085, or mute the thread https://github.com/notifications/unsubscribe-auth/ALnqAlrImJ6IyvEAG1Q3RogyCs7YgBsvks5vD7UHgaJpZM4XtAlq .

ajratner commented 5 years ago

@tmerrittsmith awesome thanks!!

tmerrittsmith commented 5 years ago

I've submitted a pull request to include a notebook where the spark dependency is removed. The other one (tabular data) is not really quite how I want it: @hanzlfs did you specifically want to look at that, or just see the version where spark is removed?

hanzlfs commented 5 years ago

@tmerrittsmith thanks! I think your PR is good enough to have insights.

ajratner commented 5 years ago

@paroma assigned to you since you're looking at the PRs!

paroma commented 5 years ago

thank you for submitting this PR! (merged with #1048)

gauthamkrishna-g commented 5 years ago

@tmerrittsmith Could you kindly update on the tabular data PR in case you have any leads? Thanks!