Structured candidates, generated by heuristics

miromannino commented 6 years ago

first of all thank you for creating Snorkel, seems a good and useful tool.

We would like to use Snorkel to something like labeling images. For example, we have two face detection heuristics that are able to label faces in the picture. These two heuristics are likely to label same faces, but these heuristics could create (i.e. could label) different regions for the same face. At the same time, as you did in your examples, a crowdsourcing study could be conducted to label these images. So we have now, for the same face, multiple different rectangles representing different regions, despite they should all be around the same area.

Question: Would Snorkel be able to find correlations despite that heuristics create (and create different) candidate datapoints, rather than, as you show in your examples, just labeling pre-calculated candidates? My understanding is that the candidates are pre-defined, and the heuristics/crowd people only label them.

Second question: The source code seems to be related to heuristics that only deal with text (i.e. document, sentence, span). Seems not to be suited to have a candidate which is a rectangle, a structured data containing x, y, width, height, etc. Is there a way to do it I don't know about?

Thank you so much.

ajratner commented 6 years ago

@miromannino sounds like an awesome use case!

As you point out, Snorkel is currently adapted for classifying independent objects, i.e. not structured prediction such as adjacent / overlapping (and therefore dependent) rectangles in an image. However, (a) our base model for learning the accuracies of the labeling functions is (in principle at least) simple to extend to your kind of structured case, so we might be trying to do this at some point / you could if interested!

And (b) you might actually get pretty far just creating a few simple candidates by some grid partitioning of the image, and then treating them independently to start? Would be curious how well that would work at least as a baseline! Naive independent models often work much better than one would expect :)

Also, you might find it interesting to check out Coral (http://dawn.cs.stanford.edu/2017/09/14/coral/) @paroma

Hope some of this helps! -Alex

paroma commented 6 years ago

In terms of having image-based candidates, you can also take a look at the babble branch of Snorkel (@bhancock8) where our candidates are specific pairs of bounding boxes in an image. This file is probably relevant (https://github.com/HazyResearch/snorkel/blob/babble/snorkel/models/context.py)

The coral repo isn't open source but we have a simple tutorial that could be useful, I can give you access if that's helpful!

snorkel-team / snorkel

Structured candidates, generated by heuristics #828