Multi-GPU fails - Githubissues

Environment info

transformers version: 4.6.1
Platform: Linux-4.19.0-17-cloud-amd64-x86_64-with-glibc2.10
Python version: 3.8.10
PyTorch version (GPU?): 1.8.1+cu111
Tensorflow version (GPU?): not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Data Parallel

Who can help

Models:

openai-gpt: @sgugger

Library:

trainer: @sgugger

Examples:

maintained examples (not research project or legacy): @sgugger, @patil-suraj

Information

Model I am using (Bert, XLNet ...): openai-gpt

The problem arises when using:

[X] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: (give details below)

My dataset is a simple text file with strings for causal language modelling.

To reproduce

python run_clm.py     --model_name_or_path openai-gpt     --train_file dataset/train.txt --validation_file dataset/eval.txt     --do_train     --do_eval     --output_dir /tmp/ --method range --source fi.json --from_scratch --per_device_eval_batch_size 4 --per_device_train_batch_size 4

Error Log:

2021-07-26T14:09:12.968147055Z  sudo: setrlimit(RLIMIT_STACK): Operation not permitted
2021-07-26T14:09:14.905455906Z  07/26/2021 14:09:14 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 2distributed training: False, 16-bits training: False
2021-07-26T14:09:14.90566887Z   07/26/2021 14:09:14 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
2021-07-26T14:09:14.905680763Z  _n_gpu=2,
2021-07-26T14:09:14.905686554Z  adafactor=False,
2021-07-26T14:09:14.905691893Z  adam_beta1=0.9,
2021-07-26T14:09:14.905697154Z  adam_beta2=0.999,
2021-07-26T14:09:14.9057025Z    adam_epsilon=1e-08,
2021-07-26T14:09:14.90570797Z   dataloader_drop_last=False,
2021-07-26T14:09:14.905713094Z  dataloader_num_workers=0,
2021-07-26T14:09:14.905718126Z  dataloader_pin_memory=True,
2021-07-26T14:09:14.905723969Z  ddp_find_unused_parameters=None,
2021-07-26T14:09:14.905729253Z  debug=[],
2021-07-26T14:09:14.905734499Z  deepspeed=None,
2021-07-26T14:09:14.9057397Z    disable_tqdm=False,
2021-07-26T14:09:14.905744923Z  do_eval=True,
2021-07-26T14:09:14.905749956Z  do_predict=False,
2021-07-26T14:09:14.90575516Z   do_train=True,
2021-07-26T14:09:14.90576029Z   eval_accumulation_steps=None,
2021-07-26T14:09:14.905766046Z  eval_steps=500,
2021-07-26T14:09:14.905771809Z  evaluation_strategy=IntervalStrategy.STEPS,
2021-07-26T14:09:14.905777566Z  fp16=False,
2021-07-26T14:09:14.905782742Z  fp16_backend=auto,
2021-07-26T14:09:14.905787796Z  fp16_full_eval=False,
2021-07-26T14:09:14.90579285Z   fp16_opt_level=O1,
2021-07-26T14:09:14.90579783Z   gradient_accumulation_steps=32,
2021-07-26T14:09:14.905802916Z  greater_is_better=None,
2021-07-26T14:09:14.905808523Z  group_by_length=False,
2021-07-26T14:09:14.905813853Z  ignore_data_skip=False,
2021-07-26T14:09:14.905819176Z  label_names=None,
2021-07-26T14:09:14.905824413Z  label_smoothing_factor=0.0,
2021-07-26T14:09:14.905829632Z  learning_rate=5e-05,
2021-07-26T14:09:14.905834616Z  length_column_name=length,
2021-07-26T14:09:14.905839636Z  load_best_model_at_end=False,
2021-07-26T14:09:14.905844662Z  local_rank=-1,
2021-07-26T14:09:14.905850119Z  log_level=-1,
2021-07-26T14:09:14.905855292Z  log_level_replica=-1,
2021-07-26T14:09:14.905860668Z  log_on_each_node=True,
2021-07-26T14:09:14.905865976Z  logging_dir=result/runs/Jul26_14-09-14_cffe56d6abc4,
2021-07-26T14:09:14.905871216Z  logging_first_step=False,
2021-07-26T14:09:14.905876242Z  logging_steps=500,
2021-07-26T14:09:14.905881425Z  logging_strategy=IntervalStrategy.STEPS,
2021-07-26T14:09:14.905903565Z  lr_scheduler_type=SchedulerType.LINEAR,
2021-07-26T14:09:14.905909738Z  max_grad_norm=1.0,
2021-07-26T14:09:14.905915195Z  max_steps=50000,
2021-07-26T14:09:14.905920608Z  metric_for_best_model=None,
2021-07-26T14:09:14.905925952Z  mp_parameters=,
2021-07-26T14:09:14.905931035Z  no_cuda=False,
2021-07-26T14:09:14.905936031Z  num_train_epochs=3.0,
2021-07-26T14:09:14.905941121Z  output_dir=result,
2021-07-26T14:09:14.905946155Z  overwrite_output_dir=True,
2021-07-26T14:09:14.905951772Z  past_index=-1,
2021-07-26T14:09:14.905957084Z  per_device_eval_batch_size=16,
2021-07-26T14:09:14.905962457Z  per_device_train_batch_size=32,
2021-07-26T14:09:14.905967855Z  prediction_loss_only=False,
2021-07-26T14:09:14.905973078Z  push_to_hub=False,
2021-07-26T14:09:14.905978145Z  push_to_hub_model_id=result,
2021-07-26T14:09:14.905983324Z  push_to_hub_organization=None,
2021-07-26T14:09:14.905988388Z  push_to_hub_token=None,
2021-07-26T14:09:14.905993985Z  remove_unused_columns=True,
2021-07-26T14:09:14.905999497Z  report_to=[],
2021-07-26T14:09:14.906004944Z  resume_from_checkpoint=None,
2021-07-26T14:09:14.906010281Z  run_name=result,
2021-07-26T14:09:14.906015348Z  save_on_each_node=False,
2021-07-26T14:09:14.906020454Z  save_steps=500,
2021-07-26T14:09:14.906025527Z  save_strategy=IntervalStrategy.STEPS,
2021-07-26T14:09:14.906030714Z  save_total_limit=1,
2021-07-26T14:09:14.906036287Z  seed=42,
2021-07-26T14:09:14.90604172Z   sharded_ddp=[],
2021-07-26T14:09:14.90604725Z   skip_memory_metrics=True,
2021-07-26T14:09:14.906052407Z  tpu_metrics_debug=False,
2021-07-26T14:09:14.906057473Z  tpu_num_cores=None,
2021-07-26T14:09:14.906062617Z  use_legacy_prediction_loop=False,
2021-07-26T14:09:14.906067774Z  warmup_ratio=0.0,
2021-07-26T14:09:14.90607286Z   warmup_steps=0,
2021-07-26T14:09:14.906078463Z  weight_decay=0.0,
2021-07-26T14:09:14.906083927Z  )
2021-07-26T14:09:15.117365107Z  07/26/2021 14:09:15 - WARNING - datasets.builder - Using custom data configuration default-dfca9c6f12495150
2021-07-26T14:09:15.118233822Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139871027286176 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
2021-07-26T14:09:15.118379685Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139871027286176 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
2021-07-26T14:09:15.118514014Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139866173991472 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
2021-07-26T14:09:15.118567887Z  07/26/2021 14:09:15 - INFO - datasets.builder - Generating dataset text (/root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
2021-07-26T14:09:15.12032563Z   Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...
2021-07-26T14:09:15.120337297Z  07/26/2021 14:09:15 - INFO - datasets.utils.download_manager - Downloading took 0.0 min
2021-07-26T14:09:15.121994254Z  07/26/2021 14:09:15 - INFO - datasets.utils.download_manager - Checksum Computation took 0.0 min
2021-07-26T14:09:15.122429438Z  
     0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████|    2/2 [00:00<00:00, 5761.41it/s]
2021-07-26T14:09:15.124508599Z  07/26/2021 14:09:15 - INFO - datasets.utils.info_utils - Unable to verify checksums.
2021-07-26T14:09:15.124597847Z  07/26/2021 14:09:15 - INFO - datasets.builder - Generating split train
2021-07-26T14:09:15.125310516Z  
     0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████|    2/2 [00:00<00:00, 1147.55it/s]
2021-07-26T14:09:15.128544997Z  07/26/2021 14:09:15 - INFO - datasets.arrow_writer - Done writing 2000 examples in 164067 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-train.arrow.
2021-07-26T14:09:15.128626548Z  07/26/2021 14:09:15 - INFO - datasets.builder - Generating split validation
2021-07-26T14:09:15.12993743Z   07/26/2021 14:09:15 - INFO - datasets.arrow_writer - Done writing 1000 examples in 90150 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-validation.arrow.
2021-07-26T14:09:15.130003546Z  07/26/2021 14:09:15 - INFO - datasets.utils.info_utils - Unable to verify splits sizes.
2021-07-26T14:09:15.130088692Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139866173989600 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
2021-07-26T14:09:15.130360478Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139866173989600 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
2021-07-26T14:09:15.130449829Z  Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.
2021-07-26T14:09:15.130456275Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139866173991472 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
2021-07-26T14:09:15.130475953Z  07/26/2021 14:09:15 - INFO - datasets.builder - Constructing Dataset for split train, validation, from /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5
2021-07-26T14:09:15.314137303Z  
0   tables [00:00, ? tables/s]

0   tables [00:00, ? tables/s]

     0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████|    2/2 [00:00<00:00, 655.77it/s]
2021-07-26T14:09:15.31416541Z   [INFO|file_utils.py:1624] 2021-07-26 14:09:15,313 >> https://huggingface.co/openai-gpt/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpd5znm5l1
2021-07-26T14:09:15.496180381Z  
Downloading:      0%|          | 0.00/656 [00:00<?, ?B/s]
Downloading:    100%|██████████| 656/656 [00:00<00:00, 433kB/s]
2021-07-26T14:09:15.496209117Z  [INFO|file_utils.py:1628] 2021-07-26 14:09:15,496 >> storing https://huggingface.co/openai-gpt/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:15.496286347Z  [INFO|file_utils.py:1636] 2021-07-26 14:09:15,496 >> creating metadata file for /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:15.496582551Z  [INFO|configuration_utils.py:545] 2021-07-26 14:09:15,496 >> loading configuration file https://huggingface.co/openai-gpt/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:15.497318074Z  [INFO|configuration_utils.py:581] 2021-07-26 14:09:15,497 >> Model config OpenAIGPTConfig {
2021-07-26T14:09:15.497326601Z    "afn": "gelu",
2021-07-26T14:09:15.497332651Z    "architectures": [
2021-07-26T14:09:15.497338467Z      "OpenAIGPTLMHeadModel"
2021-07-26T14:09:15.49734389Z     ],
2021-07-26T14:09:15.497349194Z    "attn_pdrop": 0.1,
2021-07-26T14:09:15.497354591Z    "embd_pdrop": 0.1,
2021-07-26T14:09:15.497360424Z    "initializer_range": 0.02,
2021-07-26T14:09:15.497366131Z    "layer_norm_epsilon": 1e-05,
2021-07-26T14:09:15.4973717Z      "model_type": "openai-gpt",
2021-07-26T14:09:15.49737771Z     "n_ctx": 512,
2021-07-26T14:09:15.49738331Z     "n_embd": 768,
2021-07-26T14:09:15.497388484Z    "n_head": 12,
2021-07-26T14:09:15.497393747Z    "n_layer": 12,
2021-07-26T14:09:15.497399167Z    "n_positions": 512,
2021-07-26T14:09:15.497404934Z    "n_special": 0,
2021-07-26T14:09:15.497410553Z    "predict_special_tokens": true,
2021-07-26T14:09:15.497416327Z    "resid_pdrop": 0.1,
2021-07-26T14:09:15.497434673Z    "summary_activation": null,
2021-07-26T14:09:15.497440436Z    "summary_first_dropout": 0.1,
2021-07-26T14:09:15.497446023Z    "summary_proj_to_labels": true,
2021-07-26T14:09:15.497451297Z    "summary_type": "cls_index",
2021-07-26T14:09:15.497456789Z    "summary_use_proj": true,
2021-07-26T14:09:15.49746268Z     "task_specific_params": {
2021-07-26T14:09:15.497468433Z      "text-generation": {
2021-07-26T14:09:15.497474113Z        "do_sample": true,
2021-07-26T14:09:15.497479797Z        "max_length": 50
2021-07-26T14:09:15.497485073Z      }
2021-07-26T14:09:15.49749015Z     },
2021-07-26T14:09:15.497495326Z    "transformers_version": "4.9.0",
2021-07-26T14:09:15.497500982Z    "vocab_size": 40478
2021-07-26T14:09:15.497506886Z  }
2021-07-26T14:09:15.497512492Z  
2021-07-26T14:09:15.675411198Z  [INFO|tokenization_auto.py:432] 2021-07-26 14:09:15,674 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
2021-07-26T14:09:15.851918363Z  [INFO|configuration_utils.py:545] 2021-07-26 14:09:15,851 >> loading configuration file https://huggingface.co/openai-gpt/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:15.852684702Z  [INFO|configuration_utils.py:581] 2021-07-26 14:09:15,852 >> Model config OpenAIGPTConfig {
2021-07-26T14:09:15.852691992Z    "afn": "gelu",
2021-07-26T14:09:15.85269584Z     "architectures": [
2021-07-26T14:09:15.852699315Z      "OpenAIGPTLMHeadModel"
2021-07-26T14:09:15.852702686Z    ],
2021-07-26T14:09:15.852706345Z    "attn_pdrop": 0.1,
2021-07-26T14:09:15.852709633Z    "embd_pdrop": 0.1,
2021-07-26T14:09:15.852712825Z    "initializer_range": 0.02,
2021-07-26T14:09:15.852716035Z    "layer_norm_epsilon": 1e-05,
2021-07-26T14:09:15.852719184Z    "model_type": "openai-gpt",
2021-07-26T14:09:15.852722288Z    "n_ctx": 512,
2021-07-26T14:09:15.852725375Z    "n_embd": 768,
2021-07-26T14:09:15.852728435Z    "n_head": 12,
2021-07-26T14:09:15.852731725Z    "n_layer": 12,
2021-07-26T14:09:15.852734975Z    "n_positions": 512,
2021-07-26T14:09:15.852738185Z    "n_special": 0,
2021-07-26T14:09:15.852741425Z    "predict_special_tokens": true,
2021-07-26T14:09:15.852744547Z    "resid_pdrop": 0.1,
2021-07-26T14:09:15.85274759Z     "summary_activation": null,
2021-07-26T14:09:15.852750587Z    "summary_first_dropout": 0.1,
2021-07-26T14:09:15.852753673Z    "summary_proj_to_labels": true,
2021-07-26T14:09:15.852769472Z    "summary_type": "cls_index",
2021-07-26T14:09:15.852772952Z    "summary_use_proj": true,
2021-07-26T14:09:15.852776136Z    "task_specific_params": {
2021-07-26T14:09:15.852779304Z      "text-generation": {
2021-07-26T14:09:15.852782414Z        "do_sample": true,
2021-07-26T14:09:15.852785664Z        "max_length": 50
2021-07-26T14:09:15.852788824Z      }
2021-07-26T14:09:15.852791737Z    },
2021-07-26T14:09:15.852795052Z    "transformers_version": "4.9.0",
2021-07-26T14:09:15.852798497Z    "vocab_size": 40478
2021-07-26T14:09:15.85280183Z   }
2021-07-26T14:09:15.852805286Z  
2021-07-26T14:09:16.215260602Z  [INFO|file_utils.py:1624] 2021-07-26 14:09:16,215 >> https://huggingface.co/openai-gpt/resolve/main/vocab.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp5ct5tg0n
2021-07-26T14:09:16.457642584Z  
Downloading:      0%|          | 0.00/816k [00:00<?, ?B/s]
Downloading:    100%|██████████| 816k/816k [00:00<00:00, 14.9MB/s]
2021-07-26T14:09:16.457666203Z  [INFO|file_utils.py:1628] 2021-07-26 14:09:16,457 >> storing https://huggingface.co/openai-gpt/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/918c57540c636a2a662770d208fcf20aa8c3faea78201fc612e5c84f052f1119.ac55819e76b0f8b0c32cbb407436947d090d98f8952f38376ee249ed382927ab
2021-07-26T14:09:16.457749557Z  [INFO|file_utils.py:1636] 2021-07-26 14:09:16,457 >> creating metadata file for /root/.cache/huggingface/transformers/918c57540c636a2a662770d208fcf20aa8c3faea78201fc612e5c84f052f1119.ac55819e76b0f8b0c32cbb407436947d090d98f8952f38376ee249ed382927ab
2021-07-26T14:09:16.642597998Z  [INFO|file_utils.py:1624] 2021-07-26 14:09:16,642 >> https://huggingface.co/openai-gpt/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp2_1m78tv
2021-07-26T14:09:16.874544236Z  
Downloading:      0%|          | 0.00/458k [00:00<?, ?B/s]
Downloading:    100%|██████████| 458k/458k [00:00<00:00, 10.9MB/s]
2021-07-26T14:09:16.874569317Z  [INFO|file_utils.py:1628] 2021-07-26 14:09:16,874 >> storing https://huggingface.co/openai-gpt/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/a682e219a788dde0e4f77bc5a470d85a4d7e493420506ce7e3266f7be122cf9e.2150b9689fda7ca7c6224ff32672c004259f974e96934e8eb69d8dd546d682db
2021-07-26T14:09:16.87473933Z   [INFO|file_utils.py:1636] 2021-07-26 14:09:16,874 >> creating metadata file for /root/.cache/huggingface/transformers/a682e219a788dde0e4f77bc5a470d85a4d7e493420506ce7e3266f7be122cf9e.2150b9689fda7ca7c6224ff32672c004259f974e96934e8eb69d8dd546d682db
2021-07-26T14:09:17.0542553Z    [INFO|file_utils.py:1624] 2021-07-26 14:09:17,054 >> https://huggingface.co/openai-gpt/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpiqlissjs
2021-07-26T14:09:17.308757452Z  
Downloading:      0%|          | 0.00/1.27M [00:00<?, ?B/s]
Downloading:    100%|██████████| 1.27M/1.27M [00:00<00:00, 19.6MB/s]
2021-07-26T14:09:17.308790611Z  [INFO|file_utils.py:1628] 2021-07-26 14:09:17,308 >> storing https://huggingface.co/openai-gpt/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/325373fcbb0daa99905371727842a87ae9ca0f02f71db071720bb4d5a59076cf.b1810f3c6ed9fc0632664008484a9b569103559c04ac90321723cd808a3a96f9
2021-07-26T14:09:17.308827786Z  [INFO|file_utils.py:1636] 2021-07-26 14:09:17,308 >> creating metadata file for /root/.cache/huggingface/transformers/325373fcbb0daa99905371727842a87ae9ca0f02f71db071720bb4d5a59076cf.b1810f3c6ed9fc0632664008484a9b569103559c04ac90321723cd808a3a96f9
2021-07-26T14:09:17.838142571Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,837 >> loading file https://huggingface.co/openai-gpt/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/918c57540c636a2a662770d208fcf20aa8c3faea78201fc612e5c84f052f1119.ac55819e76b0f8b0c32cbb407436947d090d98f8952f38376ee249ed382927ab
2021-07-26T14:09:17.838167038Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,837 >> loading file https://huggingface.co/openai-gpt/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/a682e219a788dde0e4f77bc5a470d85a4d7e493420506ce7e3266f7be122cf9e.2150b9689fda7ca7c6224ff32672c004259f974e96934e8eb69d8dd546d682db
2021-07-26T14:09:17.838171311Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,837 >> loading file https://huggingface.co/openai-gpt/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/325373fcbb0daa99905371727842a87ae9ca0f02f71db071720bb4d5a59076cf.b1810f3c6ed9fc0632664008484a9b569103559c04ac90321723cd808a3a96f9
2021-07-26T14:09:17.838174874Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,838 >> loading file https://huggingface.co/openai-gpt/resolve/main/added_tokens.json from cache at None
2021-07-26T14:09:17.838177733Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,838 >> loading file https://huggingface.co/openai-gpt/resolve/main/special_tokens_map.json from cache at None
2021-07-26T14:09:17.83818803Z   [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,838 >> loading file https://huggingface.co/openai-gpt/resolve/main/tokenizer_config.json from cache at None
2021-07-26T14:09:18.023973304Z  [INFO|configuration_utils.py:545] 2021-07-26 14:09:18,023 >> loading configuration file https://huggingface.co/openai-gpt/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:18.025605412Z  [INFO|configuration_utils.py:581] 2021-07-26 14:09:18,025 >> Model config OpenAIGPTConfig {
2021-07-26T14:09:18.025632076Z    "afn": "gelu",
2021-07-26T14:09:18.025638821Z    "architectures": [
2021-07-26T14:09:18.025644803Z      "OpenAIGPTLMHeadModel"
2021-07-26T14:09:18.02565048Z     ],
2021-07-26T14:09:18.025655907Z    "attn_pdrop": 0.1,
2021-07-26T14:09:18.025659711Z    "embd_pdrop": 0.1,
2021-07-26T14:09:18.025663648Z    "initializer_range": 0.02,
2021-07-26T14:09:18.02566734Z     "layer_norm_epsilon": 1e-05,
2021-07-26T14:09:18.025671169Z    "model_type": "openai-gpt",
2021-07-26T14:09:18.025686901Z    "n_ctx": 512,
2021-07-26T14:09:18.025690748Z    "n_embd": 768,
2021-07-26T14:09:18.025694256Z    "n_head": 12,
2021-07-26T14:09:18.025697812Z    "n_layer": 12,
2021-07-26T14:09:18.025701325Z    "n_positions": 512,
2021-07-26T14:09:18.025705268Z    "n_special": 0,
2021-07-26T14:09:18.025709002Z    "predict_special_tokens": true,
2021-07-26T14:09:18.025712833Z    "resid_pdrop": 0.1,
2021-07-26T14:09:18.025716428Z    "summary_activation": null,
2021-07-26T14:09:18.025721606Z    "summary_first_dropout": 0.1,
2021-07-26T14:09:18.025727781Z    "summary_proj_to_labels": true,
2021-07-26T14:09:18.025732321Z    "summary_type": "cls_index",
2021-07-26T14:09:18.025735991Z    "summary_use_proj": true,
2021-07-26T14:09:18.025739869Z    "task_specific_params": {
2021-07-26T14:09:18.025743781Z      "text-generation": {
2021-07-26T14:09:18.025747651Z        "do_sample": true,
2021-07-26T14:09:18.025751454Z        "max_length": 50
2021-07-26T14:09:18.025755031Z      }
2021-07-26T14:09:18.025758401Z    },
2021-07-26T14:09:18.025761928Z    "transformers_version": "4.9.0",
2021-07-26T14:09:18.025765657Z    "vocab_size": 40478
2021-07-26T14:09:18.025769586Z  }
2021-07-26T14:09:18.02577327Z   
2021-07-26T14:09:23.021111594Z  07/26/2021 14:09:23 - INFO - __main__ - Training new model from scratch - Total size=111.14M params
2021-07-26T14:09:23.070773083Z  07/26/2021 14:09:23 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-8e82676f86a14c2c.arrow
2021-07-26T14:09:23.094906386Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 2000 examples in 207498 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/tmpbehl1qz0.
2021-07-26T14:09:23.117860452Z  
Running tokenizer on dataset:   0%|          | 0/2 [00:00<?, ?ba/s]
Running tokenizer on dataset: 100%|██████████| 2/2 [00:00<00:00, 43.33ba/s]
2021-07-26T14:09:23.133773375Z  07/26/2021 14:09:23 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-35b2963f79b3b422.arrow
2021-07-26T14:09:23.139336489Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 1000 examples in 113806 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/tmp9n9hycnj.
2021-07-26T14:09:23.144312664Z  
Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]
Running tokenizer on dataset: 100%|██████████| 1/1 [00:00<00:00, 46.94ba/s]
2021-07-26T14:09:23.235184764Z  07/26/2021 14:09:23 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-f0614aafe173fe5c.arrow
2021-07-26T14:09:23.340753289Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 72 examples in 480120 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/tmpbjayy6wf.
2021-07-26T14:09:23.344673188Z  
Grouping    texts in chunks of 512:   0%|          | 0/2 [00:00<?, ?ba/s]
Grouping    texts in chunks of 512: 100%|██████████| 2/2 [00:00<00:00, 10.21ba/s]
Grouping    texts in chunks of 512: 100%|██████████| 2/2 [00:00<00:00, 10.20ba/s]
2021-07-26T14:09:23.449866442Z  07/26/2021 14:09:23 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-9636fc49daf5222e.arrow
2021-07-26T14:09:23.454281769Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 39 examples in 260064 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/tmpz8sa4yn6.
2021-07-26T14:09:23.482471097Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 200000 indices in 320000000 bytes .
2021-07-26T14:09:23.485361448Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 7000 indices in 392000 bytes .
2021-07-26T14:09:25.751105446Z  
Grouping    texts in chunks of 512:   0%|          | 0/1 [00:00<?, ?ba/s]
Grouping    texts in chunks of 512: 100%|██████████| 1/1 [00:00<00:00,  9.15ba/s]
Grouping    texts in chunks of 512: 100%|██████████| 1/1 [00:00<00:00,  9.13ba/s]
2021-07-26T14:09:25.751141123Z  [INFO|trainer.py:404] 2021-07-26 14:09:25,750 >> max_steps is given, it will override any value given in num_train_epochs
2021-07-26T14:09:25.757944575Z  [INFO|trainer.py:1164] 2021-07-26 14:09:25,757 >> ***** Running training *****
2021-07-26T14:09:25.757972847Z  [INFO|trainer.py:1165] 2021-07-26 14:09:25,757 >>   Num examples = 200000
2021-07-26T14:09:25.757978165Z  [INFO|trainer.py:1166] 2021-07-26 14:09:25,757 >>   Num Epochs = 516
2021-07-26T14:09:25.757982299Z  [INFO|trainer.py:1167] 2021-07-26 14:09:25,757 >>   Instantaneous batch size per device = 32
2021-07-26T14:09:25.757986728Z  [INFO|trainer.py:1168] 2021-07-26 14:09:25,757 >>   Total train batch size (w. parallel, distributed & accumulation) = 2048
2021-07-26T14:09:25.757990875Z  [INFO|trainer.py:1169] 2021-07-26 14:09:25,757 >>   Gradient Accumulation steps = 32
2021-07-26T14:09:25.757994803Z  [INFO|trainer.py:1170] 2021-07-26 14:09:25,757 >>   Total optimization steps = 50000
2021-07-26T14:09:27.841919702Z  
     0%|          | 0/50000 [00:00<?, ?it/s]Traceback (most recent call last):
2021-07-26T14:09:27.841956297Z    File "run_clm.py", line 572, in <module>
2021-07-26T14:09:27.841963933Z      main()
2021-07-26T14:09:27.841969132Z    File "run_clm.py", line 522, in main
2021-07-26T14:09:27.841991003Z      train_result = trainer.train(resume_from_checkpoint=checkpoint)
2021-07-26T14:09:27.841996801Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/trainer.py", line 1280, in train
2021-07-26T14:09:27.842002482Z      tr_loss += self.training_step(model, inputs)
2021-07-26T14:09:27.842007478Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/trainer.py", line 1773, in training_step
2021-07-26T14:09:27.842012807Z      loss = self.compute_loss(model, inputs)
2021-07-26T14:09:27.842017737Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/trainer.py", line 1805, in compute_loss
2021-07-26T14:09:27.84202311Z       outputs = model(**inputs)
2021-07-26T14:09:27.842028183Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
2021-07-26T14:09:27.842034154Z      result = self.forward(*input, **kwargs)
2021-07-26T14:09:27.842039413Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
2021-07-26T14:09:27.842045122Z      outputs = self.parallel_apply(replicas, inputs, kwargs)
2021-07-26T14:09:27.84205038Z     File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
2021-07-26T14:09:27.842055852Z      return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
2021-07-26T14:09:27.842061165Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
2021-07-26T14:09:27.842066725Z      output.reraise()
2021-07-26T14:09:27.842071565Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
2021-07-26T14:09:27.842077398Z      raise self.exc_type(msg)
2021-07-26T14:09:27.842082546Z  StopIteration: Caught StopIteration in replica 0 on device 0.
2021-07-26T14:09:27.842087891Z  Original Traceback (most recent call last):
2021-07-26T14:09:27.842093056Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
2021-07-26T14:09:27.842098477Z      output = module(*input, **kwargs)
2021-07-26T14:09:27.84210327Z     File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
2021-07-26T14:09:27.842108627Z      result = self.forward(*input, **kwargs)
2021-07-26T14:09:27.842113465Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/models/openai/modeling_openai.py", line 581, in forward
2021-07-26T14:09:27.842119416Z      transformer_outputs = self.transformer(
2021-07-26T14:09:27.8421263Z      File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
2021-07-26T14:09:27.842132244Z      result = self.forward(*input, **kwargs)
2021-07-26T14:09:27.842137575Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/models/openai/modeling_openai.py", line 487, in forward
2021-07-26T14:09:27.842147909Z      attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
2021-07-26T14:09:27.842153517Z  StopIteration
2021-07-26T14:09:27.842158291Z  
2021-07-26T14:09:28.598937Z 
     0%|          | 0/50000 [00:02<?, ?it/s]

Expected behavior

The same as run_clm.py with a single GPU.

huggingface / transformers

Multi-GPU fails #12890

Environment info

Who can help

Information

To reproduce

Expected behavior