I noticed that the spam dataset used in the tutorial of labeling functions is somewhat strange when it comes to the indices.
I saved the dataset in a separate csv file, as I noticed that each row of the dataset was returning 4 corresponding rows, meaning that each index in the dataset is repeated 4 times, but with different text in it. I find it a bit weird how this was not explained in the tutorial and as to why it has to be this way, because essentially, each index in a text column won't return type str as one would expect but rather Panda.core.series.Series.
Code example
If you take the df_train and do df_train.text[0] one would expect as a result pls <WEBSITE> help me get vip gun cross fire al but instead of this I get:
0 pls <WEBSITE> help me get vip gun cross fire al
0 Katycat! https://m.facebook.com/profile.php?id...
0 2011- the last year of decent music.
0 Check out this video on YouTube:
Name: text, dtype: object
So, if I try on a different dataset I always get error because df_train.text[0] will be string and applying a labeling function using PandasLFApplier fails everytime with TypeError: ("argument of type 'int' is not iterable", 'occured at index 0')
Additional context
Is there a work around or it is that snorkel was designed this way? Meaning that for new datasets, I have to intentionally replicate the indexes equally, like above, instead 0, returned 4 rows, the same for any other index?
Isnt it better to provide such an explanation in the tutorials, as to why one has to do this way? Because the tutorials are loaded through snorkel, if you try on external datasets directly and just follow the tutorial things fail and we end up back and forth figuring out what's happening.
Issue
I noticed that the spam dataset used in the tutorial of labeling functions is somewhat strange when it comes to the indices. I saved the dataset in a separate csv file, as I noticed that each row of the dataset was returning 4 corresponding rows, meaning that each index in the dataset is repeated 4 times, but with different text in it. I find it a bit weird how this was not explained in the tutorial and as to why it has to be this way, because essentially, each index in a text column won't return type
str
as one would expect but ratherPanda.core.series.Series
.Code example
If you take the
df_train
and dodf_train.text[0]
one would expect as a resultpls <WEBSITE> help me get vip gun cross fire al
but instead of this I get:So, if I try on a different dataset I always get error because
df_train.text[0]
will be string and applying a labeling function using PandasLFApplier fails everytime withTypeError: ("argument of type 'int' is not iterable", 'occured at index 0')
Additional context
Is there a work around or it is that snorkel was designed this way? Meaning that for new datasets, I have to intentionally replicate the indexes equally, like above, instead 0, returned 4 rows, the same for any other index?
Isnt it better to provide such an explanation in the tutorials, as to why one has to do this way? Because the tutorials are loaded through snorkel, if you try on external datasets directly and just follow the tutorial things fail and we end up back and forth figuring out what's happening.