sileod / tasksource

Datasets collection and preprocessings framework for NLP extreme multitask learning
Apache License 2.0
149 stars 8 forks source link

super_glue/multirc is bugged #1

Closed A1exRey closed 1 year ago

A1exRey commented 1 year ago

Hi, thanks for the great collection of datasets. But it seems that not all datasets in it are correctly preprocessed. Multirc requires paragraph, question, individual answers concatenated together for a classification. But in your case you just take the first sentence (the question itself) without adding more data. In taks.py super_glue___multirc = Classification(sentence1="question", labels="label") And during load we get:

from tasksource import list_tasks, load_task
ddf = load_task('super_glue/multirc')
index | sentence1 | labels -- | -- | -- 0 | What did the high-level effort to persuade Pakistan include? | 0 1 | What did the high-level effort to persuade Pakistan include? | 0 2 | What did the high-level effort to persuade Pakistan include? | 1 3 | What did the high-level effort to persuade Pakistan include? | 1 4 | What did the high-level effort to persuade Pakistan include? | 1

This data does not make any sense, and model will not be trained in any way. Maybe you should replace the code with something similar to this to put all the data together(following the WiC example).

super_glue___multirc = Classification( 
     sentence1=cat(["paragraph", "question","answer"], " : "),
    labels='label'
)
sileod commented 1 year ago

Hi, thanks for the great collection of datasets. But it seems that not all datasets in it are correctly preprocessed. Multirc requires paragraph, question, individual answers concatenated together for a classification. But in your case you just take the first sentence (the question itself) without adding more data. In taks.py super_glue___multirc = Classification(sentence1="question", labels="label") And during load we get:

from tasksource import list_tasks, load_task
ddf = load_task('super_glue/multirc')

index sentence1 labels 0 What did the high-level effort to persuade Pakistan include? 0 1 What did the high-level effort to persuade Pakistan include? 0 2 What did the high-level effort to persuade Pakistan include? 1 3 What did the high-level effort to persuade Pakistan include? 1 4 What did the high-level effort to persuade Pakistan include? 1 This data does not make any sense, and model will not be trained in any way. Maybe you should replace the code with something similar to this to put all the data together(following the WiC example).

super_glue___multirc = Classification( 
     sentence1=cat(["paragraph", "question","answer"], " : "),
    labels='label'
)

I apologize for that mistake. I manually check the processed datasets (and I also trained models on them) but there might be some errors I overlooked. The last release fixes that mistake. Thanks a lot for your input.