Continue pre-training using the example code "run_mlm.py"

pavlion commented 3 years ago

Environment info

transformers version: 4.3.3
Platform: Linux-5.4.0-65-generic-x86_64-with-glibc2.10 (Ubuntu 18.04)
Python version: 3.8.8
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

tokenizers: @LysandreJik
maintained examples (not research project or legacy): @sgugger, @patil-suraj

Information

Model I am using (Bert, XLNet ...): albert-xlarge-v2

The problem arises when using: transformers/examples/language-modeling/run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py)

The tasks I am working on is: continue pre-training on a specific dataset (glue/sst2)

To reproduce

Steps to reproduce the behavior:

CUDA_VISIBLE_DEVICES=0 python run_mlm.py \
    --model_name_or_path albert-xlarge-v2 \
    --dataset_name "glue" \
    --dataset_config_name "sst2" \
    --do_train \
    --do_eval \
    --output_dir ckpt/pre_training/glue

Error message

Traceback (most recent call last):
  File "src/run_mlm.py", line 447, in <module>
    main()
  File "src/run_mlm.py", line 353, in main
    tokenized_datasets = datasets.map(
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/datasets/dataset_dict.py", line 369, in map
    {
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/datasets/dataset_dict.py", line 370, in <dictcomp>
    k: dataset.map(
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1120, in map
    update_data = does_function_return_dict(test_inputs, test_indices)
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1091, in does_function_return_dict
    function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "src/run_mlm.py", line 351, in tokenize_function
    return tokenizer(examples[text_column_name], return_special_tokens_mask=True)
  File "/home/robotlab/anaconda3/envs/thesis/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2286, in __call__
    assert isinstance(text, str) or (
AssertionError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized exampl
es).

I skipped the message that relates to the model, as there's no problem with the loading of the model. Dataset is successfully downloaded. There's a warning above the assertion error

03/02/2021 14:03:44 - WARNING - datasets.builder -   Reusing dataset glue (/home/robotlab/.cache/huggingface/datasets/glue/sst2/1.0.0/7c99657241149a24692c402a5c3f34d
4c9f1df5ac2e4c3759fadea38f6cb29c4)

Expected behavior

When I continue pre-training on other datasets such as 'ag_news', 'dbpedia_14', 'imdb', there's no error and everything is fine. There are also no "dataset_config_name" in these three datasets. However, there's no error when I use dataset_name=wikitext and dataset_config_name=wikitext-2-raw-v1 in run_mlm.py

Judging from the error message above, it seems like the data format of the SST-2 is wrong so that the datasets can not handled the data correctly. Any suggestion is highly appreciated!

patil-suraj commented 3 years ago

Hi there,

the run_mlm script expects an unlabeled text dataset, i.e a dataset with the column text in it, if there is no text column then it assumes that the first column is the text column.

Here sst2 is a classification dataset and the first column is idx. So you could change the script and directly hardcode the text_column_name, which is sentence for sst2. https://github.com/huggingface/transformers/blob/b013842244df7be96b8cc841491bd1e35e475e36/examples/language-modeling/run_mlm.py#L304

And also pass the --line_by_line argument.

pavlion commented 3 years ago

Thanks for your prompt and accurate reply. This does help!

But I notice that the line_by_line argument is not needed as long as the text_column_name is hard-coded in that line. Anyways, thanks!

huggingface / transformers