Script run_mlm_no_trainer.py error

cyk1337 commented 2 years ago

Environment info

transformers version: 4.14.0.dev0
Platform: Linux-3.10.0_3-0-0-12-x86_64-with-centos-6.3-Final
Python version: 3.7.11
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): 2.7.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.3.6 (cpu)
Jax version: 0.2.26
JaxLib version: 0.1.75
Using GPU in script?: Y
Using distributed or parallel set-up in script?: Y

Who can help

@patrickvonplaten @LysandreJik

Information

Model I am using: roberta-base

The problem arises when using:

[x] the official example scripts: examples/pytorch/language-modeling/run_mlm_no_trainer.py

The tasks I am working on is:

[x] an official pre-training task: run the mlm pre-training script.

To reproduce

Steps to reproduce the behavior:

Following the official instruction at python run_mlm_no_trainer.py

python run_mlm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --model_name_or_path roberta-base \
    --output_dir /tmp/test-mlm

Expected behavior

[INFO|trainer.py:1204] 2022-01-09 20:51:14,185 >> ***** Running training *****
[INFO|trainer.py:1205] 2022-01-09 20:51:14,185 >>   Num examples = 4650
[INFO|trainer.py:1206] 2022-01-09 20:51:14,185 >>   Num Epochs = 3
[INFO|trainer.py:1207] 2022-01-09 20:51:14,185 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1208] 2022-01-09 20:51:14,186 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1209] 2022-01-09 20:51:14,186 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1210] 2022-01-09 20:51:14,186 >>   Total optimization steps = 219
  0%|                                                                                                   | 0/219 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xxx/transformers/examples/pytorch/demo/run_mlm.py", line 556, in <module>
    main()
  File "/home/xxx/transformers/examples/pytorch/demo/run_mlm.py", line 505, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1325, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1884, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1916, in compute_loss
    outputs = model(**inputs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/transformers/src/transformers/models/roberta/modeling_roberta.py", line 1108, in forward
    return_dict=return_dict,
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/transformers/src/transformers/models/roberta/modeling_roberta.py", line 819, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [8, 1024].  Tensor sizes: [1, 514]

LysandreJik commented 2 years ago

cc @sgugger

sgugger commented 2 years ago

Which command are you running exactly? The logs you produce use distributed training whereas the command you told us (which runs successfully on my side) launches the script with python.

cyk1337 commented 2 years ago

I just rerun it on another machine but got the same issue.

The exact command is:

$ python run_mlm_no_trainer.py --model_name_or_path=./roberta-base --dataset_name=wikitext --dataset_config_name=wikitext-2-raw-v1 --output_dir=./test_mlm_out

where ./roberta-base directory contains:

 $ ls roberta-base/
config.json  merges.txt  pytorch_model.bin  vocab.json

The output was:

01/11/2022 11:59:36 - INFO - __main__ - ***** Running training *****
01/11/2022 11:59:36 - INFO - __main__ -   Num examples = 2390
01/11/2022 11:59:36 - INFO - __main__ -   Num Epochs = 3
01/11/2022 11:59:36 - INFO - __main__ -   Instantaneous batch size per device = 8
01/11/2022 11:59:36 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
01/11/2022 11:59:36 - INFO - __main__ -   Gradient Accumulation steps = 1
01/11/2022 11:59:36 - INFO - __main__ -   Total optimization steps = 897
  0%|                                                                                                                                                                                         | 0/897 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_mlm_no_trainer.py", line 566, in <module>
    main()
  File "run_mlm_no_trainer.py", line 513, in main
    outputs = model(**batch)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 1106, in forward
    return_dict=return_dict,
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 817, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [8, 1024].  Tensor sizes: [1, 514]
  0%|                                                                                                                                                                                         | 0/897 [00:00<?, ?it/s]

Possible Solution The issue reported was due to the last dim mismatch between the target size (1024) and tensor size (514) oftoken_type_ids. I suspect this is caused by unspecified --max_seq_length=512. With additional argument --max_seq_length=512, it works. Is it correct?

sgugger commented 2 years ago

I have no idea what the content of your roberta-base folder is, but your addition is probably correct. It works with the official checkpoint, where the model specifies a max length the script then uses, maybe it's the part missing in your local checkpoint.

cyk1337 commented 2 years ago

Yeah you are correct. The checkpoint that the official script downloaded works. There might be something mismatched in my cached roberta-base folder (just manually downloaded from AWS, probability not newest ones). Thank you for pointing out this.

huggingface / transformers