Open jeswan opened 4 years ago
Comment by sleepinyourhat Thursday Apr 16, 2020 at 16:16 GMT
That's not corruption, it's a quirk of the original TSV format for SNLI. The training and test/dev sets for SNLI were constructed differently—there's more detail in the readme that comes with the original/main public version of the data.
Comment by Apsod Thursday Apr 16, 2020 at 20:05 GMT
Thanks for the quick reply! I'm not sure I understand, in the README for the SNLI dataset (https://nlp.stanford.edu/projects/snli/), it only mentions that annotator labels can be left blank, but the original .tsv files still have a consistent number of fields (14, i.e. 13 tab characters).
The .tsv files you get from download_data_glue, on the other hand, has an inconsistent number of fields: train.tsv has lines with 11, 13, 14, and 15 fields. dev.tsv has lines with both 14 and 15 fields. test.tsv as well has lines with both 14 and 15 fields.
Comment by sleepinyourhat Thursday Apr 16, 2020 at 20:23 GMT
Okay, that is odd.
Comment by sleepinyourhat Thursday Apr 16, 2020 at 20:23 GMT
Also tagging @W4ngatang, who should know more than I do about what happens in these scripts.
Comment by Apsod Friday Apr 17, 2020 at 08:13 GMT
This is a snippet of SNLI/train.tsv, after opening it in libreoffice calc (separated by tab)
index | ... | label1 | gold_label | ||||
---|---|---|---|---|---|---|---|
1168 | ... | neutral | neutral | ||||
1169 | ... | contradiction | contradiction | ||||
1170 | ... | entailment | entailment | ||||
1171 | ... | contradiction | contradiction | ||||
1172 | ... | neutral | neutral | neutral | neutral | neutral | |
1173 | ... | entailment | entailment | ||||
1174 | ... | contradiction | contradiction | ||||
1175 | ... | entailment | entailment | ||||
1176 | ... | neutral | neutral | neutral | neutral | neutral | neutral |
1177 | ... | contradiction | contradiction |
(Note that the trailing empty fields above are not actually blank fields, but have been filled in by libreoffice. Datum 1171 ends with ...\tcontradiction\tcontradiction\n
, not ...\tcontradiction\tcontradiction\t\t\t\t\n
)
Counting the number of fields per line yields the following:
file | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|
dev.tsv | 0 | 0 | 0 | 11 | 9832 |
test.tsv | 0 | 0 | 0 | 9 | 9816 |
train.tsv | 510712 | 0 | 6 | 2226 | 36424 |
Since the majority of lines have either 11 or 15 fields, i suppose it is not a matter of wild tab characters in the text data, but rather a matter of missing blank fields in the labels.
Comment by pyeres Wednesday Apr 29, 2020 at 14:56 GMT
This issue highlights that there are some differences between the original SNLI data hosted here, and the SNLI data hosted here (and downloaded by jiant/scripts/download_glue_data.py
). My understanding is that download_glue_data.py
just downloads the hosted data, so it's not clear to me where issues regarding this data should be filed. @sleepinyourhat, @W4ngatang.
Comment by W4ngatang Tuesday May 05, 2020 at 01:40 GMT
I'm seeing similar issues with train.tsv
but not dev.tsv
or test.tsv
(using pandas.read_csv
, 15 columns for the latter two). Very long ago, I think I converted everything to tsv so I could jam more supported tasks into the load_tsv
function, but it's probably better just to use the original data format (jsonl) and avoid these issues all together. If that sounds good to everyone, I can send a PR for that.
Issue by Apsod Thursday Apr 16, 2020 at 16:13 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/1073
After running the download_glue_data script, the train.tsv file has an inconsistent number of fields:
e.g. line 1171 has 11 fields:
'1171\t1974336555.jpg#0\t1974336555.jpg#0r1c\t( ( ( A ( young man ) ) ( in ( his ( mid twenties ) ) ) ) ( ( is ( ( kicking ( his ( left foot ) ) ) ( ( ( about two ) feet ) ( off ( ( the leaf ) ( ( covered ( ( ground , ) ( with ( paved ( asphalt ( and ( green ( ( plants and ) trees ) ) ) ) ) ) ) ) ( in ( the background ) ) ) ) ) ) ) ) . ) )\t( ( two women ) ( play soccer ) )\t(ROOT (S (NP (NP (DT A) (JJ young) (NN man)) (PP (IN in) (NP (PRP$ his) (JJ mid) (NNS twenties)))) (VP (VBZ is) (VP (VBG kicking) (NP (PRP$ his) (JJ left) (NN foot)) (PP (NP (QP (RB about) (CD two)) (NNS feet)) (IN off) (NP (NP (DT the) (NN leaf)) (VP (VBN covered) (NP (NP (NN ground)) (, ,) (PP (IN with) (NP (JJ paved) (NN asphalt) (CC and) (JJ green) (NNS plants) (CC and) (NNS trees)))) (PP (IN in) (NP (DT the) (NN background)))))))) (. .)))\t(ROOT (NP (NP (CD two) (NNS women)) (NP (NN play) (NN soccer))))\tA young man in his mid twenties is kicking his left foot about two feet off the leaf covered ground, with paved asphalt and green plants and trees in the background.\ttwo women play soccer\tcontradiction\tcontradiction\n'
line 1172 has 15 fields:'1172\t1974336555.jpg#0\t1974336555.jpg#0r1n\t( ( ( A ( young man ) ) ( in ( his ( mid twenties ) ) ) ) ( ( is ( ( kicking ( his ( left foot ) ) ) ( ( ( about two ) feet ) ( off ( ( the leaf ) ( ( covered ( ( ground , ) ( with ( paved ( asphalt ( and ( green ( ( plants and ) trees ) ) ) ) ) ) ) ) ( in ( the background ) ) ) ) ) ) ) ) . ) )\t( ( a punk ) ( kicks leaves ) )\t(ROOT (S (NP (NP (DT A) (JJ young) (NN man)) (PP (IN in) (NP (PRP$ his) (JJ mid) (NNS twenties)))) (VP (VBZ is) (VP (VBG kicking) (NP (PRP$ his) (JJ left) (NN foot)) (PP (NP (QP (RB about) (CD two)) (NNS feet)) (IN off) (NP (NP (DT the) (NN leaf)) (VP (VBN covered) (NP (NP (NN ground)) (, ,) (PP (IN with) (NP (JJ paved) (NN asphalt) (CC and) (JJ green) (NNS plants) (CC and) (NNS trees)))) (PP (IN in) (NP (DT the) (NN background)))))))) (. .)))\t(ROOT (S (NP (DT a) (NN punk)) (VP (VBZ kicks) (NP (NNS leaves)))))\tA young man in his mid twenties is kicking his left foot about two feet off the leaf covered ground, with paved asphalt and green plants and trees in the background.\ta punk kicks leaves\tneutral\t\tneutral\tneutral\tneutral\tneutral\n'