Closed cyk1337 closed 2 years ago
cc @sgugger
Which command are you running exactly? The logs you produce use distributed training whereas the command you told us (which runs successfully on my side) launches the script with python.
I just rerun it on another machine but got the same issue.
The exact command is:
$ python run_mlm_no_trainer.py --model_name_or_path=./roberta-base --dataset_name=wikitext --dataset_config_name=wikitext-2-raw-v1 --output_dir=./test_mlm_out
where ./roberta-base
directory contains:
$ ls roberta-base/
config.json merges.txt pytorch_model.bin vocab.json
The output was:
01/11/2022 11:59:36 - INFO - __main__ - ***** Running training *****
01/11/2022 11:59:36 - INFO - __main__ - Num examples = 2390
01/11/2022 11:59:36 - INFO - __main__ - Num Epochs = 3
01/11/2022 11:59:36 - INFO - __main__ - Instantaneous batch size per device = 8
01/11/2022 11:59:36 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
01/11/2022 11:59:36 - INFO - __main__ - Gradient Accumulation steps = 1
01/11/2022 11:59:36 - INFO - __main__ - Total optimization steps = 897
0%| | 0/897 [00:00<?, ?it/s]Traceback (most recent call last):
File "run_mlm_no_trainer.py", line 566, in <module>
main()
File "run_mlm_no_trainer.py", line 513, in main
outputs = model(**batch)
File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 1106, in forward
return_dict=return_dict,
File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 817, in forward
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1. Target sizes: [8, 1024]. Tensor sizes: [1, 514]
0%| | 0/897 [00:00<?, ?it/s]
Possible Solution
The issue reported was due to the last dim mismatch between the target size (1024) and tensor size (514) oftoken_type_ids
. I suspect this is caused by unspecified --max_seq_length=512
. With additional argument --max_seq_length=512
, it works. Is it correct?
I have no idea what the content of your roberta-base folder is, but your addition is probably correct. It works with the official checkpoint, where the model specifies a max length the script then uses, maybe it's the part missing in your local checkpoint.
Yeah you are correct. The checkpoint that the official script downloaded works. There might be something mismatched in my cached roberta-base folder (just manually downloaded from AWS, probability not newest ones). Thank you for pointing out this.
Environment info
transformers
version: 4.14.0.dev0Who can help
@patrickvonplaten @LysandreJik
Information
Model I am using: roberta-base
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Following the official instruction at python run_mlm_no_trainer.py
Expected behavior