snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Error when doing labeling in binary case using external dataset #1551

Closed stenpiren closed 4 years ago

stenpiren commented 4 years ago

Issue

I noticed that the spam dataset used in the tutorial of labeling functions is somewhat strange when it comes to the indices. I saved the dataset in a separate csv file, as I noticed that each row of the dataset was returning 4 corresponding rows, meaning that each index in the dataset is repeated 4 times, but with different text in it. I find it a bit weird how this was not explained in the tutorial and as to why it has to be this way, because essentially, each index in a text column won't return type str as one would expect but rather Panda.core.series.Series.

Code example

If you take the df_train and do df_train.text[0] one would expect as a result pls <WEBSITE> help me get vip gun cross fire al but instead of this I get:

0    pls <WEBSITE> help me get vip gun cross fire al
0    Katycat! https://m.facebook.com/profile.php?id...
0                2011- the last year of decent music.
0                    Check out this video on YouTube:
Name: text, dtype: object

So, if I try on a different dataset I always get error because df_train.text[0] will be string and applying a labeling function using PandasLFApplier fails everytime with TypeError: ("argument of type 'int' is not iterable", 'occured at index 0')

Additional context

Is there a work around or it is that snorkel was designed this way? Meaning that for new datasets, I have to intentionally replicate the indexes equally, like above, instead 0, returned 4 rows, the same for any other index?

Isnt it better to provide such an explanation in the tutorials, as to why one has to do this way? Because the tutorials are loaded through snorkel, if you try on external datasets directly and just follow the tutorial things fail and we end up back and forth figuring out what's happening.

stenpiren commented 4 years ago

Will close it, since it is mentioned in the tutorial that the object passed to x will be Pandas core series. But I find it a bit strange.