huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.62k stars 27.15k forks source link

Continue pre-training using the example code "run_mlm.py" #10474

Closed pavlion closed 3 years ago

pavlion commented 3 years ago

Environment info

Who can help

Information

Model I am using (Bert, XLNet ...): albert-xlarge-v2

The problem arises when using: transformers/examples/language-modeling/run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py)

The tasks I am working on is: continue pre-training on a specific dataset (glue/sst2)

To reproduce

Steps to reproduce the behavior:

CUDA_VISIBLE_DEVICES=0 python run_mlm.py \
    --model_name_or_path albert-xlarge-v2 \
    --dataset_name "glue" \
    --dataset_config_name "sst2" \
    --do_train \
    --do_eval \
    --output_dir ckpt/pre_training/glue

Error message

Traceback (most recent call last):
  File "src/run_mlm.py", line 447, in <module>
    main()
  File "src/run_mlm.py", line 353, in main
    tokenized_datasets = datasets.map(
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/datasets/dataset_dict.py", line 369, in map
    {
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/datasets/dataset_dict.py", line 370, in <dictcomp>
    k: dataset.map(
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1120, in map
    update_data = does_function_return_dict(test_inputs, test_indices)
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1091, in does_function_return_dict
    function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "src/run_mlm.py", line 351, in tokenize_function
    return tokenizer(examples[text_column_name], return_special_tokens_mask=True)
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2286, in __call__
    assert isinstance(text, str) or (
AssertionError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized exampl
es).

I skipped the message that relates to the model, as there's no problem with the loading of the model. Dataset is successfully downloaded. There's a warning above the assertion error

03/02/2021 14:03:44 - WARNING - datasets.builder -   Reusing dataset glue (/home/robotlab/.cache/huggingface/datasets/glue/sst2/1.0.0/7c99657241149a24692c402a5c3f34d
4c9f1df5ac2e4c3759fadea38f6cb29c4)

Expected behavior

When I continue pre-training on other datasets such as 'ag_news', 'dbpedia_14', 'imdb', there's no error and everything is fine. There are also no "dataset_config_name" in these three datasets. However, there's no error when I use dataset_name=wikitext and dataset_config_name=wikitext-2-raw-v1 in run_mlm.py

Judging from the error message above, it seems like the data format of the SST-2 is wrong so that the datasets can not handled the data correctly. Any suggestion is highly appreciated!

patil-suraj commented 3 years ago

Hi there,

the run_mlm script expects an unlabeled text dataset, i.e a dataset with the column text in it, if there is no text column then it assumes that the first column is the text column.

Here sst2 is a classification dataset and the first column is idx. So you could change the script and directly hardcode the text_column_name, which is sentence for sst2. https://github.com/huggingface/transformers/blob/b013842244df7be96b8cc841491bd1e35e475e36/examples/language-modeling/run_mlm.py#L304

And also pass the --line_by_line argument.

pavlion commented 3 years ago

Thanks for your prompt and accurate reply. This does help!

But I notice that the line_by_line argument is not needed as long as the text_column_name is hard-coded in that line. Anyways, thanks!