snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Question: Has anyone used snorkel for tabular numerical data? #803

Closed matt256 closed 6 years ago

matt256 commented 7 years ago

I have a very large sampling of tabular data including mainly numerical fields where each line is an example I would like to try and label. Looking through the documentation and examples, I don't see a way to use the tool in this manner or at least easily get the data into a usable format. Does anyone know if this can or has been done? Any thoughts? The "concept" seems similar, but your sought after audience was text based labeling. Just wondering if it could be adapted. Thanks! Also, great work. Heard about the package on O'Reilly Data Show.

ajratner commented 7 years ago

Hi @matt256 thanks for listening and checking Snorkel out! Is this tabular data in a standardized format that's easily machine readable, or embedded in text / PDFs / etc? If the latter, you can check out Fondeur (see this blog post), which will be merged into master as a module soon!

ajratner commented 7 years ago

Might also want to check out http://pages.cs.wisc.edu/~thodrek/, he does some work in the area of structured tabular data that might be of interest!

matt256 commented 7 years ago

thank you for the responses @ajratner. It's actually already in a csv/tabular format. I'm betting there are ways to make it work, though. A lot of possibilities here. The link to Fondeur was quite valuable than you again.

ajratner commented 7 years ago

Great!

chrismre commented 7 years ago

The Snorkel idea is leveraged in HoloClean https://arxiv.org/abs/1702.00820 as Alex pointed out @thodrek is going to release some open source too! My guess is that these techniques might be helpful for the type of structured data that you're describing.

thodrek commented 7 years ago

Hi @matt256 please checkout our blogpost on HoloClean (http://dawn.cs.stanford.edu/2017/05/12/holoclean/). I believe the problem you are describing can be viewed as a data cleaning task. Think of labeling as trying to suggest a correct value for each cell in your data. HoloClean will do this for you. The weak-supervision rules here correspond to a set of integrity constraints over the data. We are actively refactoring the HoloClean code and it will be released end of this month. I will keep you posted.

jim-bo commented 6 years ago

I was wondering if your group was still working on multi-modal problem you mentioned on your website? I'm looking to incorporate some tabular data with my unstructured text to aide in label generation and eventually in the discriminative model itself.

matt256 commented 6 years ago

We ended up delaying that project a bit so we could check out what the released holoclean code looked like and learn some other's experiences. Your question was pretty timely, though, as we are about to get things rolling again. I noticed that it hadn't been released yet.

But after reading the blog post recommended above, I think thodrek's response was spot on for what we want to do.

@thodrek, do you all still intend to release a version? looking forward to it, if so. Looks like you all have done some great work.

thodrek commented 6 years ago

@matt256 @jim-bo Hey guys the Holoclean release will happen very soon. We are done refactoring our initial code. We are in the phase of cleaning it up and expecting the first release to happen within December. The repo is still in "private" mode but once released the code will be hosted here: https://github.com/HoloClean

I will keep you up to date :)

matt256 commented 6 years ago

Thank you so much, @thodrek. looking forward to it

ajratner commented 6 years ago

Closing for now--will be accessible via the "Q&A" link in README--but feel free to re-open!

thodrek commented 6 years ago

@matt256 @jim-bo I just wanted to let you know that holoclean was released. You can find more info here: http://www.holoclean.io Please do not hesitate to ping me in case of questions

asstergi commented 5 years ago

@matt256 Did you manage to use Snorkel, Fonduer or Holoclean for your purposes? I have a similar task and I'm looking for some guidance.

thodrek commented 5 years ago

@asstergi @matt256 HoloClean can be applied to tabular numerical data. Please post an issue here https://github.com/HoloClean/holoclean and we will follow up there.

asstergi commented 5 years ago

@thodrek I posted an new issue in HoloClean. Looking forward to your reply.