nyu-mll / jiant-v1-legacy

The jiant toolkit for general-purpose text understanding models
MIT License
21 stars 9 forks source link

Potentially corrupts SNLI data #1073

Open jeswan opened 4 years ago

jeswan commented 4 years ago

Issue by Apsod Thursday Apr 16, 2020 at 16:13 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/1073


After running the download_glue_data script, the train.tsv file has an inconsistent number of fields:

e.g. line 1171 has 11 fields: '1171\t1974336555.jpg#0\t1974336555.jpg#0r1c\t( ( ( A ( young man ) ) ( in ( his ( mid twenties ) ) ) ) ( ( is ( ( kicking ( his ( left foot ) ) ) ( ( ( about two ) feet ) ( off ( ( the leaf ) ( ( covered ( ( ground , ) ( with ( paved ( asphalt ( and ( green ( ( plants and ) trees ) ) ) ) ) ) ) ) ( in ( the background ) ) ) ) ) ) ) ) . ) )\t( ( two women ) ( play soccer ) )\t(ROOT (S (NP (NP (DT A) (JJ young) (NN man)) (PP (IN in) (NP (PRP$ his) (JJ mid) (NNS twenties)))) (VP (VBZ is) (VP (VBG kicking) (NP (PRP$ his) (JJ left) (NN foot)) (PP (NP (QP (RB about) (CD two)) (NNS feet)) (IN off) (NP (NP (DT the) (NN leaf)) (VP (VBN covered) (NP (NP (NN ground)) (, ,) (PP (IN with) (NP (JJ paved) (NN asphalt) (CC and) (JJ green) (NNS plants) (CC and) (NNS trees)))) (PP (IN in) (NP (DT the) (NN background)))))))) (. .)))\t(ROOT (NP (NP (CD two) (NNS women)) (NP (NN play) (NN soccer))))\tA young man in his mid twenties is kicking his left foot about two feet off the leaf covered ground, with paved asphalt and green plants and trees in the background.\ttwo women play soccer\tcontradiction\tcontradiction\n' line 1172 has 15 fields: '1172\t1974336555.jpg#0\t1974336555.jpg#0r1n\t( ( ( A ( young man ) ) ( in ( his ( mid twenties ) ) ) ) ( ( is ( ( kicking ( his ( left foot ) ) ) ( ( ( about two ) feet ) ( off ( ( the leaf ) ( ( covered ( ( ground , ) ( with ( paved ( asphalt ( and ( green ( ( plants and ) trees ) ) ) ) ) ) ) ) ( in ( the background ) ) ) ) ) ) ) ) . ) )\t( ( a punk ) ( kicks leaves ) )\t(ROOT (S (NP (NP (DT A) (JJ young) (NN man)) (PP (IN in) (NP (PRP$ his) (JJ mid) (NNS twenties)))) (VP (VBZ is) (VP (VBG kicking) (NP (PRP$ his) (JJ left) (NN foot)) (PP (NP (QP (RB about) (CD two)) (NNS feet)) (IN off) (NP (NP (DT the) (NN leaf)) (VP (VBN covered) (NP (NP (NN ground)) (, ,) (PP (IN with) (NP (JJ paved) (NN asphalt) (CC and) (JJ green) (NNS plants) (CC and) (NNS trees)))) (PP (IN in) (NP (DT the) (NN background)))))))) (. .)))\t(ROOT (S (NP (DT a) (NN punk)) (VP (VBZ kicks) (NP (NNS leaves)))))\tA young man in his mid twenties is kicking his left foot about two feet off the leaf covered ground, with paved asphalt and green plants and trees in the background.\ta punk kicks leaves\tneutral\t\tneutral\tneutral\tneutral\tneutral\n'

jeswan commented 4 years ago

Comment by sleepinyourhat Thursday Apr 16, 2020 at 16:16 GMT


That's not corruption, it's a quirk of the original TSV format for SNLI. The training and test/dev sets for SNLI were constructed differently—there's more detail in the readme that comes with the original/main public version of the data.

jeswan commented 4 years ago

Comment by Apsod Thursday Apr 16, 2020 at 20:05 GMT


Thanks for the quick reply! I'm not sure I understand, in the README for the SNLI dataset (https://nlp.stanford.edu/projects/snli/), it only mentions that annotator labels can be left blank, but the original .tsv files still have a consistent number of fields (14, i.e. 13 tab characters).

The .tsv files you get from download_data_glue, on the other hand, has an inconsistent number of fields: train.tsv has lines with 11, 13, 14, and 15 fields. dev.tsv has lines with both 14 and 15 fields. test.tsv as well has lines with both 14 and 15 fields.

jeswan commented 4 years ago

Comment by sleepinyourhat Thursday Apr 16, 2020 at 20:23 GMT


Okay, that is odd.

jeswan commented 4 years ago

Comment by sleepinyourhat Thursday Apr 16, 2020 at 20:23 GMT


(Thanks for flagging!)

jeswan commented 4 years ago

Comment by sleepinyourhat Thursday Apr 16, 2020 at 20:23 GMT


Also tagging @W4ngatang, who should know more than I do about what happens in these scripts.

jeswan commented 4 years ago

Comment by Apsod Friday Apr 17, 2020 at 08:13 GMT


This is a snippet of SNLI/train.tsv, after opening it in libreoffice calc (separated by tab)

index ... label1 gold_label        
1168 ... neutral neutral        
1169 ... contradiction contradiction        
1170 ... entailment entailment        
1171 ... contradiction contradiction        
1172 ... neutral   neutral neutral neutral neutral
1173 ... entailment entailment        
1174 ... contradiction contradiction        
1175 ... entailment entailment        
1176 ... neutral neutral neutral neutral neutral neutral
1177 ... contradiction contradiction        

(Note that the trailing empty fields above are not actually blank fields, but have been filled in by libreoffice. Datum 1171 ends with ...\tcontradiction\tcontradiction\n, not ...\tcontradiction\tcontradiction\t\t\t\t\n)

Counting the number of fields per line yields the following:

file 11 12 13 14 15
dev.tsv 0 0 0 11 9832
test.tsv 0 0 0 9 9816
train.tsv 510712 0 6 2226 36424

Since the majority of lines have either 11 or 15 fields, i suppose it is not a matter of wild tab characters in the text data, but rather a matter of missing blank fields in the labels.

jeswan commented 4 years ago

Comment by pyeres Wednesday Apr 29, 2020 at 14:56 GMT


This issue highlights that there are some differences between the original SNLI data hosted here, and the SNLI data hosted here (and downloaded by jiant/scripts/download_glue_data.py). My understanding is that download_glue_data.py just downloads the hosted data, so it's not clear to me where issues regarding this data should be filed. @sleepinyourhat, @W4ngatang.

jeswan commented 4 years ago

Comment by W4ngatang Tuesday May 05, 2020 at 01:40 GMT


I'm seeing similar issues with train.tsv but not dev.tsv or test.tsv (using pandas.read_csv, 15 columns for the latter two). Very long ago, I think I converted everything to tsv so I could jam more supported tasks into the load_tsv function, but it's probably better just to use the original data format (jsonl) and avoid these issues all together. If that sounds good to everyone, I can send a PR for that.

jeswan commented 4 years ago

Comment by sleepinyourhat Tuesday May 05, 2020 at 15:46 GMT


Please do!

jeswan commented 4 years ago

Comment by W4ngatang Thursday May 21, 2020 at 04:27 GMT


@Apsod, should be resolved with #1086 if this is still an issue for you