Closed pavlion closed 3 years ago
Hi there,
the run_mlm
script expects an unlabeled text dataset, i.e a dataset with the column text
in it, if there is no text
column then it assumes that the first column is the text column.
Here sst2
is a classification dataset and the first column is idx
.
So you could change the script and directly hardcode the text_column_name
, which is sentence
for sst2
.
https://github.com/huggingface/transformers/blob/b013842244df7be96b8cc841491bd1e35e475e36/examples/language-modeling/run_mlm.py#L304
And also pass the --line_by_line
argument.
Thanks for your prompt and accurate reply. This does help!
But I notice that the line_by_line argument is not needed as long as the text_column_name
is hard-coded in that line.
Anyways, thanks!
Environment info
transformers
version: 4.3.3Who can help
Information
Model I am using (Bert, XLNet ...): albert-xlarge-v2
The problem arises when using: transformers/examples/language-modeling/run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py)
The tasks I am working on is: continue pre-training on a specific dataset (glue/sst2)
To reproduce
Steps to reproduce the behavior:
Error message
I skipped the message that relates to the model, as there's no problem with the loading of the model. Dataset is successfully downloaded. There's a warning above the assertion error
Expected behavior
When I continue pre-training on other datasets such as 'ag_news', 'dbpedia_14', 'imdb', there's no error and everything is fine. There are also no "dataset_config_name" in these three datasets. However, there's no error when I use
dataset_name=wikitext
anddataset_config_name=wikitext-2-raw-v1
inrun_mlm.py
Judging from the error message above, it seems like the data format of the SST-2 is wrong so that the datasets can not handled the data correctly. Any suggestion is highly appreciated!