huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.73k stars 26.94k forks source link

Multi-GPU fails #12890

Closed avi-otterai closed 3 years ago

avi-otterai commented 3 years ago

Environment info

Who can help

Models:

Library:

Examples:

Information

Model I am using (Bert, XLNet ...): openai-gpt

The problem arises when using:

The tasks I am working on is:

My dataset is a simple text file with strings for causal language modelling.

To reproduce

python run_clm.py     --model_name_or_path openai-gpt     --train_file dataset/train.txt --validation_file dataset/eval.txt     --do_train     --do_eval     --output_dir /tmp/ --method range --source fi.json --from_scratch --per_device_eval_batch_size 4 --per_device_train_batch_size 4

Error Log:

2021-07-26T14:09:12.968147055Z  sudo: setrlimit(RLIMIT_STACK): Operation not permitted
2021-07-26T14:09:14.905455906Z  07/26/2021 14:09:14 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 2distributed training: False, 16-bits training: False
2021-07-26T14:09:14.90566887Z   07/26/2021 14:09:14 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
2021-07-26T14:09:14.905680763Z  _n_gpu=2,
2021-07-26T14:09:14.905686554Z  adafactor=False,
2021-07-26T14:09:14.905691893Z  adam_beta1=0.9,
2021-07-26T14:09:14.905697154Z  adam_beta2=0.999,
2021-07-26T14:09:14.9057025Z    adam_epsilon=1e-08,
2021-07-26T14:09:14.90570797Z   dataloader_drop_last=False,
2021-07-26T14:09:14.905713094Z  dataloader_num_workers=0,
2021-07-26T14:09:14.905718126Z  dataloader_pin_memory=True,
2021-07-26T14:09:14.905723969Z  ddp_find_unused_parameters=None,
2021-07-26T14:09:14.905729253Z  debug=[],
2021-07-26T14:09:14.905734499Z  deepspeed=None,
2021-07-26T14:09:14.9057397Z    disable_tqdm=False,
2021-07-26T14:09:14.905744923Z  do_eval=True,
2021-07-26T14:09:14.905749956Z  do_predict=False,
2021-07-26T14:09:14.90575516Z   do_train=True,
2021-07-26T14:09:14.90576029Z   eval_accumulation_steps=None,
2021-07-26T14:09:14.905766046Z  eval_steps=500,
2021-07-26T14:09:14.905771809Z  evaluation_strategy=IntervalStrategy.STEPS,
2021-07-26T14:09:14.905777566Z  fp16=False,
2021-07-26T14:09:14.905782742Z  fp16_backend=auto,
2021-07-26T14:09:14.905787796Z  fp16_full_eval=False,
2021-07-26T14:09:14.90579285Z   fp16_opt_level=O1,
2021-07-26T14:09:14.90579783Z   gradient_accumulation_steps=32,
2021-07-26T14:09:14.905802916Z  greater_is_better=None,
2021-07-26T14:09:14.905808523Z  group_by_length=False,
2021-07-26T14:09:14.905813853Z  ignore_data_skip=False,
2021-07-26T14:09:14.905819176Z  label_names=None,
2021-07-26T14:09:14.905824413Z  label_smoothing_factor=0.0,
2021-07-26T14:09:14.905829632Z  learning_rate=5e-05,
2021-07-26T14:09:14.905834616Z  length_column_name=length,
2021-07-26T14:09:14.905839636Z  load_best_model_at_end=False,
2021-07-26T14:09:14.905844662Z  local_rank=-1,
2021-07-26T14:09:14.905850119Z  log_level=-1,
2021-07-26T14:09:14.905855292Z  log_level_replica=-1,
2021-07-26T14:09:14.905860668Z  log_on_each_node=True,
2021-07-26T14:09:14.905865976Z  logging_dir=result/runs/Jul26_14-09-14_cffe56d6abc4,
2021-07-26T14:09:14.905871216Z  logging_first_step=False,
2021-07-26T14:09:14.905876242Z  logging_steps=500,
2021-07-26T14:09:14.905881425Z  logging_strategy=IntervalStrategy.STEPS,
2021-07-26T14:09:14.905903565Z  lr_scheduler_type=SchedulerType.LINEAR,
2021-07-26T14:09:14.905909738Z  max_grad_norm=1.0,
2021-07-26T14:09:14.905915195Z  max_steps=50000,
2021-07-26T14:09:14.905920608Z  metric_for_best_model=None,
2021-07-26T14:09:14.905925952Z  mp_parameters=,
2021-07-26T14:09:14.905931035Z  no_cuda=False,
2021-07-26T14:09:14.905936031Z  num_train_epochs=3.0,
2021-07-26T14:09:14.905941121Z  output_dir=result,
2021-07-26T14:09:14.905946155Z  overwrite_output_dir=True,
2021-07-26T14:09:14.905951772Z  past_index=-1,
2021-07-26T14:09:14.905957084Z  per_device_eval_batch_size=16,
2021-07-26T14:09:14.905962457Z  per_device_train_batch_size=32,
2021-07-26T14:09:14.905967855Z  prediction_loss_only=False,
2021-07-26T14:09:14.905973078Z  push_to_hub=False,
2021-07-26T14:09:14.905978145Z  push_to_hub_model_id=result,
2021-07-26T14:09:14.905983324Z  push_to_hub_organization=None,
2021-07-26T14:09:14.905988388Z  push_to_hub_token=None,
2021-07-26T14:09:14.905993985Z  remove_unused_columns=True,
2021-07-26T14:09:14.905999497Z  report_to=[],
2021-07-26T14:09:14.906004944Z  resume_from_checkpoint=None,
2021-07-26T14:09:14.906010281Z  run_name=result,
2021-07-26T14:09:14.906015348Z  save_on_each_node=False,
2021-07-26T14:09:14.906020454Z  save_steps=500,
2021-07-26T14:09:14.906025527Z  save_strategy=IntervalStrategy.STEPS,
2021-07-26T14:09:14.906030714Z  save_total_limit=1,
2021-07-26T14:09:14.906036287Z  seed=42,
2021-07-26T14:09:14.90604172Z   sharded_ddp=[],
2021-07-26T14:09:14.90604725Z   skip_memory_metrics=True,
2021-07-26T14:09:14.906052407Z  tpu_metrics_debug=False,
2021-07-26T14:09:14.906057473Z  tpu_num_cores=None,
2021-07-26T14:09:14.906062617Z  use_legacy_prediction_loop=False,
2021-07-26T14:09:14.906067774Z  warmup_ratio=0.0,
2021-07-26T14:09:14.90607286Z   warmup_steps=0,
2021-07-26T14:09:14.906078463Z  weight_decay=0.0,
2021-07-26T14:09:14.906083927Z  )
2021-07-26T14:09:15.117365107Z  07/26/2021 14:09:15 - WARNING - datasets.builder - Using custom data configuration default-dfca9c6f12495150
2021-07-26T14:09:15.118233822Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139871027286176 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
2021-07-26T14:09:15.118379685Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139871027286176 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
2021-07-26T14:09:15.118514014Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139866173991472 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
2021-07-26T14:09:15.118567887Z  07/26/2021 14:09:15 - INFO - datasets.builder - Generating dataset text (/root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
2021-07-26T14:09:15.12032563Z   Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...
2021-07-26T14:09:15.120337297Z  07/26/2021 14:09:15 - INFO - datasets.utils.download_manager - Downloading took 0.0 min
2021-07-26T14:09:15.121994254Z  07/26/2021 14:09:15 - INFO - datasets.utils.download_manager - Checksum Computation took 0.0 min
2021-07-26T14:09:15.122429438Z  
     0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████|    2/2 [00:00<00:00, 5761.41it/s]
2021-07-26T14:09:15.124508599Z  07/26/2021 14:09:15 - INFO - datasets.utils.info_utils - Unable to verify checksums.
2021-07-26T14:09:15.124597847Z  07/26/2021 14:09:15 - INFO - datasets.builder - Generating split train
2021-07-26T14:09:15.125310516Z  
     0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████|    2/2 [00:00<00:00, 1147.55it/s]
2021-07-26T14:09:15.128544997Z  07/26/2021 14:09:15 - INFO - datasets.arrow_writer - Done writing 2000 examples in 164067 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-train.arrow.
2021-07-26T14:09:15.128626548Z  07/26/2021 14:09:15 - INFO - datasets.builder - Generating split validation
2021-07-26T14:09:15.12993743Z   07/26/2021 14:09:15 - INFO - datasets.arrow_writer - Done writing 1000 examples in 90150 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-validation.arrow.
2021-07-26T14:09:15.130003546Z  07/26/2021 14:09:15 - INFO - datasets.utils.info_utils - Unable to verify splits sizes.
2021-07-26T14:09:15.130088692Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139866173989600 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
2021-07-26T14:09:15.130360478Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139866173989600 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
2021-07-26T14:09:15.130449829Z  Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.
2021-07-26T14:09:15.130456275Z  07/26/2021 14:09:15 - INFO - datasets.utils.filelock - Lock 139866173991472 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-dfca9c6f12495150_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
2021-07-26T14:09:15.130475953Z  07/26/2021 14:09:15 - INFO - datasets.builder - Constructing Dataset for split train, validation, from /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5
2021-07-26T14:09:15.314137303Z  
0   tables [00:00, ? tables/s]

0   tables [00:00, ? tables/s]

     0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████|    2/2 [00:00<00:00, 655.77it/s]
2021-07-26T14:09:15.31416541Z   [INFO|file_utils.py:1624] 2021-07-26 14:09:15,313 >> https://huggingface.co/openai-gpt/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpd5znm5l1
2021-07-26T14:09:15.496180381Z  
Downloading:      0%|          | 0.00/656 [00:00<?, ?B/s]
Downloading:    100%|██████████| 656/656 [00:00<00:00, 433kB/s]
2021-07-26T14:09:15.496209117Z  [INFO|file_utils.py:1628] 2021-07-26 14:09:15,496 >> storing https://huggingface.co/openai-gpt/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:15.496286347Z  [INFO|file_utils.py:1636] 2021-07-26 14:09:15,496 >> creating metadata file for /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:15.496582551Z  [INFO|configuration_utils.py:545] 2021-07-26 14:09:15,496 >> loading configuration file https://huggingface.co/openai-gpt/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:15.497318074Z  [INFO|configuration_utils.py:581] 2021-07-26 14:09:15,497 >> Model config OpenAIGPTConfig {
2021-07-26T14:09:15.497326601Z    "afn": "gelu",
2021-07-26T14:09:15.497332651Z    "architectures": [
2021-07-26T14:09:15.497338467Z      "OpenAIGPTLMHeadModel"
2021-07-26T14:09:15.49734389Z     ],
2021-07-26T14:09:15.497349194Z    "attn_pdrop": 0.1,
2021-07-26T14:09:15.497354591Z    "embd_pdrop": 0.1,
2021-07-26T14:09:15.497360424Z    "initializer_range": 0.02,
2021-07-26T14:09:15.497366131Z    "layer_norm_epsilon": 1e-05,
2021-07-26T14:09:15.4973717Z      "model_type": "openai-gpt",
2021-07-26T14:09:15.49737771Z     "n_ctx": 512,
2021-07-26T14:09:15.49738331Z     "n_embd": 768,
2021-07-26T14:09:15.497388484Z    "n_head": 12,
2021-07-26T14:09:15.497393747Z    "n_layer": 12,
2021-07-26T14:09:15.497399167Z    "n_positions": 512,
2021-07-26T14:09:15.497404934Z    "n_special": 0,
2021-07-26T14:09:15.497410553Z    "predict_special_tokens": true,
2021-07-26T14:09:15.497416327Z    "resid_pdrop": 0.1,
2021-07-26T14:09:15.497434673Z    "summary_activation": null,
2021-07-26T14:09:15.497440436Z    "summary_first_dropout": 0.1,
2021-07-26T14:09:15.497446023Z    "summary_proj_to_labels": true,
2021-07-26T14:09:15.497451297Z    "summary_type": "cls_index",
2021-07-26T14:09:15.497456789Z    "summary_use_proj": true,
2021-07-26T14:09:15.49746268Z     "task_specific_params": {
2021-07-26T14:09:15.497468433Z      "text-generation": {
2021-07-26T14:09:15.497474113Z        "do_sample": true,
2021-07-26T14:09:15.497479797Z        "max_length": 50
2021-07-26T14:09:15.497485073Z      }
2021-07-26T14:09:15.49749015Z     },
2021-07-26T14:09:15.497495326Z    "transformers_version": "4.9.0",
2021-07-26T14:09:15.497500982Z    "vocab_size": 40478
2021-07-26T14:09:15.497506886Z  }
2021-07-26T14:09:15.497512492Z  
2021-07-26T14:09:15.675411198Z  [INFO|tokenization_auto.py:432] 2021-07-26 14:09:15,674 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
2021-07-26T14:09:15.851918363Z  [INFO|configuration_utils.py:545] 2021-07-26 14:09:15,851 >> loading configuration file https://huggingface.co/openai-gpt/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:15.852684702Z  [INFO|configuration_utils.py:581] 2021-07-26 14:09:15,852 >> Model config OpenAIGPTConfig {
2021-07-26T14:09:15.852691992Z    "afn": "gelu",
2021-07-26T14:09:15.85269584Z     "architectures": [
2021-07-26T14:09:15.852699315Z      "OpenAIGPTLMHeadModel"
2021-07-26T14:09:15.852702686Z    ],
2021-07-26T14:09:15.852706345Z    "attn_pdrop": 0.1,
2021-07-26T14:09:15.852709633Z    "embd_pdrop": 0.1,
2021-07-26T14:09:15.852712825Z    "initializer_range": 0.02,
2021-07-26T14:09:15.852716035Z    "layer_norm_epsilon": 1e-05,
2021-07-26T14:09:15.852719184Z    "model_type": "openai-gpt",
2021-07-26T14:09:15.852722288Z    "n_ctx": 512,
2021-07-26T14:09:15.852725375Z    "n_embd": 768,
2021-07-26T14:09:15.852728435Z    "n_head": 12,
2021-07-26T14:09:15.852731725Z    "n_layer": 12,
2021-07-26T14:09:15.852734975Z    "n_positions": 512,
2021-07-26T14:09:15.852738185Z    "n_special": 0,
2021-07-26T14:09:15.852741425Z    "predict_special_tokens": true,
2021-07-26T14:09:15.852744547Z    "resid_pdrop": 0.1,
2021-07-26T14:09:15.85274759Z     "summary_activation": null,
2021-07-26T14:09:15.852750587Z    "summary_first_dropout": 0.1,
2021-07-26T14:09:15.852753673Z    "summary_proj_to_labels": true,
2021-07-26T14:09:15.852769472Z    "summary_type": "cls_index",
2021-07-26T14:09:15.852772952Z    "summary_use_proj": true,
2021-07-26T14:09:15.852776136Z    "task_specific_params": {
2021-07-26T14:09:15.852779304Z      "text-generation": {
2021-07-26T14:09:15.852782414Z        "do_sample": true,
2021-07-26T14:09:15.852785664Z        "max_length": 50
2021-07-26T14:09:15.852788824Z      }
2021-07-26T14:09:15.852791737Z    },
2021-07-26T14:09:15.852795052Z    "transformers_version": "4.9.0",
2021-07-26T14:09:15.852798497Z    "vocab_size": 40478
2021-07-26T14:09:15.85280183Z   }
2021-07-26T14:09:15.852805286Z  
2021-07-26T14:09:16.215260602Z  [INFO|file_utils.py:1624] 2021-07-26 14:09:16,215 >> https://huggingface.co/openai-gpt/resolve/main/vocab.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp5ct5tg0n
2021-07-26T14:09:16.457642584Z  
Downloading:      0%|          | 0.00/816k [00:00<?, ?B/s]
Downloading:    100%|██████████| 816k/816k [00:00<00:00, 14.9MB/s]
2021-07-26T14:09:16.457666203Z  [INFO|file_utils.py:1628] 2021-07-26 14:09:16,457 >> storing https://huggingface.co/openai-gpt/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/918c57540c636a2a662770d208fcf20aa8c3faea78201fc612e5c84f052f1119.ac55819e76b0f8b0c32cbb407436947d090d98f8952f38376ee249ed382927ab
2021-07-26T14:09:16.457749557Z  [INFO|file_utils.py:1636] 2021-07-26 14:09:16,457 >> creating metadata file for /root/.cache/huggingface/transformers/918c57540c636a2a662770d208fcf20aa8c3faea78201fc612e5c84f052f1119.ac55819e76b0f8b0c32cbb407436947d090d98f8952f38376ee249ed382927ab
2021-07-26T14:09:16.642597998Z  [INFO|file_utils.py:1624] 2021-07-26 14:09:16,642 >> https://huggingface.co/openai-gpt/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp2_1m78tv
2021-07-26T14:09:16.874544236Z  
Downloading:      0%|          | 0.00/458k [00:00<?, ?B/s]
Downloading:    100%|██████████| 458k/458k [00:00<00:00, 10.9MB/s]
2021-07-26T14:09:16.874569317Z  [INFO|file_utils.py:1628] 2021-07-26 14:09:16,874 >> storing https://huggingface.co/openai-gpt/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/a682e219a788dde0e4f77bc5a470d85a4d7e493420506ce7e3266f7be122cf9e.2150b9689fda7ca7c6224ff32672c004259f974e96934e8eb69d8dd546d682db
2021-07-26T14:09:16.87473933Z   [INFO|file_utils.py:1636] 2021-07-26 14:09:16,874 >> creating metadata file for /root/.cache/huggingface/transformers/a682e219a788dde0e4f77bc5a470d85a4d7e493420506ce7e3266f7be122cf9e.2150b9689fda7ca7c6224ff32672c004259f974e96934e8eb69d8dd546d682db
2021-07-26T14:09:17.0542553Z    [INFO|file_utils.py:1624] 2021-07-26 14:09:17,054 >> https://huggingface.co/openai-gpt/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpiqlissjs
2021-07-26T14:09:17.308757452Z  
Downloading:      0%|          | 0.00/1.27M [00:00<?, ?B/s]
Downloading:    100%|██████████| 1.27M/1.27M [00:00<00:00, 19.6MB/s]
2021-07-26T14:09:17.308790611Z  [INFO|file_utils.py:1628] 2021-07-26 14:09:17,308 >> storing https://huggingface.co/openai-gpt/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/325373fcbb0daa99905371727842a87ae9ca0f02f71db071720bb4d5a59076cf.b1810f3c6ed9fc0632664008484a9b569103559c04ac90321723cd808a3a96f9
2021-07-26T14:09:17.308827786Z  [INFO|file_utils.py:1636] 2021-07-26 14:09:17,308 >> creating metadata file for /root/.cache/huggingface/transformers/325373fcbb0daa99905371727842a87ae9ca0f02f71db071720bb4d5a59076cf.b1810f3c6ed9fc0632664008484a9b569103559c04ac90321723cd808a3a96f9
2021-07-26T14:09:17.838142571Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,837 >> loading file https://huggingface.co/openai-gpt/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/918c57540c636a2a662770d208fcf20aa8c3faea78201fc612e5c84f052f1119.ac55819e76b0f8b0c32cbb407436947d090d98f8952f38376ee249ed382927ab
2021-07-26T14:09:17.838167038Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,837 >> loading file https://huggingface.co/openai-gpt/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/a682e219a788dde0e4f77bc5a470d85a4d7e493420506ce7e3266f7be122cf9e.2150b9689fda7ca7c6224ff32672c004259f974e96934e8eb69d8dd546d682db
2021-07-26T14:09:17.838171311Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,837 >> loading file https://huggingface.co/openai-gpt/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/325373fcbb0daa99905371727842a87ae9ca0f02f71db071720bb4d5a59076cf.b1810f3c6ed9fc0632664008484a9b569103559c04ac90321723cd808a3a96f9
2021-07-26T14:09:17.838174874Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,838 >> loading file https://huggingface.co/openai-gpt/resolve/main/added_tokens.json from cache at None
2021-07-26T14:09:17.838177733Z  [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,838 >> loading file https://huggingface.co/openai-gpt/resolve/main/special_tokens_map.json from cache at None
2021-07-26T14:09:17.83818803Z   [INFO|tokenization_utils_base.py:1730] 2021-07-26 14:09:17,838 >> loading file https://huggingface.co/openai-gpt/resolve/main/tokenizer_config.json from cache at None
2021-07-26T14:09:18.023973304Z  [INFO|configuration_utils.py:545] 2021-07-26 14:09:18,023 >> loading configuration file https://huggingface.co/openai-gpt/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
2021-07-26T14:09:18.025605412Z  [INFO|configuration_utils.py:581] 2021-07-26 14:09:18,025 >> Model config OpenAIGPTConfig {
2021-07-26T14:09:18.025632076Z    "afn": "gelu",
2021-07-26T14:09:18.025638821Z    "architectures": [
2021-07-26T14:09:18.025644803Z      "OpenAIGPTLMHeadModel"
2021-07-26T14:09:18.02565048Z     ],
2021-07-26T14:09:18.025655907Z    "attn_pdrop": 0.1,
2021-07-26T14:09:18.025659711Z    "embd_pdrop": 0.1,
2021-07-26T14:09:18.025663648Z    "initializer_range": 0.02,
2021-07-26T14:09:18.02566734Z     "layer_norm_epsilon": 1e-05,
2021-07-26T14:09:18.025671169Z    "model_type": "openai-gpt",
2021-07-26T14:09:18.025686901Z    "n_ctx": 512,
2021-07-26T14:09:18.025690748Z    "n_embd": 768,
2021-07-26T14:09:18.025694256Z    "n_head": 12,
2021-07-26T14:09:18.025697812Z    "n_layer": 12,
2021-07-26T14:09:18.025701325Z    "n_positions": 512,
2021-07-26T14:09:18.025705268Z    "n_special": 0,
2021-07-26T14:09:18.025709002Z    "predict_special_tokens": true,
2021-07-26T14:09:18.025712833Z    "resid_pdrop": 0.1,
2021-07-26T14:09:18.025716428Z    "summary_activation": null,
2021-07-26T14:09:18.025721606Z    "summary_first_dropout": 0.1,
2021-07-26T14:09:18.025727781Z    "summary_proj_to_labels": true,
2021-07-26T14:09:18.025732321Z    "summary_type": "cls_index",
2021-07-26T14:09:18.025735991Z    "summary_use_proj": true,
2021-07-26T14:09:18.025739869Z    "task_specific_params": {
2021-07-26T14:09:18.025743781Z      "text-generation": {
2021-07-26T14:09:18.025747651Z        "do_sample": true,
2021-07-26T14:09:18.025751454Z        "max_length": 50
2021-07-26T14:09:18.025755031Z      }
2021-07-26T14:09:18.025758401Z    },
2021-07-26T14:09:18.025761928Z    "transformers_version": "4.9.0",
2021-07-26T14:09:18.025765657Z    "vocab_size": 40478
2021-07-26T14:09:18.025769586Z  }
2021-07-26T14:09:18.02577327Z   
2021-07-26T14:09:23.021111594Z  07/26/2021 14:09:23 - INFO - __main__ - Training new model from scratch - Total size=111.14M params
2021-07-26T14:09:23.070773083Z  07/26/2021 14:09:23 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-8e82676f86a14c2c.arrow
2021-07-26T14:09:23.094906386Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 2000 examples in 207498 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/tmpbehl1qz0.
2021-07-26T14:09:23.117860452Z  
Running tokenizer on dataset:   0%|          | 0/2 [00:00<?, ?ba/s]
Running tokenizer on dataset: 100%|██████████| 2/2 [00:00<00:00, 43.33ba/s]
2021-07-26T14:09:23.133773375Z  07/26/2021 14:09:23 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-35b2963f79b3b422.arrow
2021-07-26T14:09:23.139336489Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 1000 examples in 113806 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/tmp9n9hycnj.
2021-07-26T14:09:23.144312664Z  
Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]
Running tokenizer on dataset: 100%|██████████| 1/1 [00:00<00:00, 46.94ba/s]
2021-07-26T14:09:23.235184764Z  07/26/2021 14:09:23 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-f0614aafe173fe5c.arrow
2021-07-26T14:09:23.340753289Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 72 examples in 480120 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/tmpbjayy6wf.
2021-07-26T14:09:23.344673188Z  
Grouping    texts in chunks of 512:   0%|          | 0/2 [00:00<?, ?ba/s]
Grouping    texts in chunks of 512: 100%|██████████| 2/2 [00:00<00:00, 10.21ba/s]
Grouping    texts in chunks of 512: 100%|██████████| 2/2 [00:00<00:00, 10.20ba/s]
2021-07-26T14:09:23.449866442Z  07/26/2021 14:09:23 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-9636fc49daf5222e.arrow
2021-07-26T14:09:23.454281769Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 39 examples in 260064 bytes /root/.cache/huggingface/datasets/text/default-dfca9c6f12495150/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/tmpz8sa4yn6.
2021-07-26T14:09:23.482471097Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 200000 indices in 320000000 bytes .
2021-07-26T14:09:23.485361448Z  07/26/2021 14:09:23 - INFO - datasets.arrow_writer - Done writing 7000 indices in 392000 bytes .
2021-07-26T14:09:25.751105446Z  
Grouping    texts in chunks of 512:   0%|          | 0/1 [00:00<?, ?ba/s]
Grouping    texts in chunks of 512: 100%|██████████| 1/1 [00:00<00:00,  9.15ba/s]
Grouping    texts in chunks of 512: 100%|██████████| 1/1 [00:00<00:00,  9.13ba/s]
2021-07-26T14:09:25.751141123Z  [INFO|trainer.py:404] 2021-07-26 14:09:25,750 >> max_steps is given, it will override any value given in num_train_epochs
2021-07-26T14:09:25.757944575Z  [INFO|trainer.py:1164] 2021-07-26 14:09:25,757 >> ***** Running training *****
2021-07-26T14:09:25.757972847Z  [INFO|trainer.py:1165] 2021-07-26 14:09:25,757 >>   Num examples = 200000
2021-07-26T14:09:25.757978165Z  [INFO|trainer.py:1166] 2021-07-26 14:09:25,757 >>   Num Epochs = 516
2021-07-26T14:09:25.757982299Z  [INFO|trainer.py:1167] 2021-07-26 14:09:25,757 >>   Instantaneous batch size per device = 32
2021-07-26T14:09:25.757986728Z  [INFO|trainer.py:1168] 2021-07-26 14:09:25,757 >>   Total train batch size (w. parallel, distributed & accumulation) = 2048
2021-07-26T14:09:25.757990875Z  [INFO|trainer.py:1169] 2021-07-26 14:09:25,757 >>   Gradient Accumulation steps = 32
2021-07-26T14:09:25.757994803Z  [INFO|trainer.py:1170] 2021-07-26 14:09:25,757 >>   Total optimization steps = 50000
2021-07-26T14:09:27.841919702Z  
     0%|          | 0/50000 [00:00<?, ?it/s]Traceback (most recent call last):
2021-07-26T14:09:27.841956297Z    File "run_clm.py", line 572, in <module>
2021-07-26T14:09:27.841963933Z      main()
2021-07-26T14:09:27.841969132Z    File "run_clm.py", line 522, in main
2021-07-26T14:09:27.841991003Z      train_result = trainer.train(resume_from_checkpoint=checkpoint)
2021-07-26T14:09:27.841996801Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/trainer.py", line 1280, in train
2021-07-26T14:09:27.842002482Z      tr_loss += self.training_step(model, inputs)
2021-07-26T14:09:27.842007478Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/trainer.py", line 1773, in training_step
2021-07-26T14:09:27.842012807Z      loss = self.compute_loss(model, inputs)
2021-07-26T14:09:27.842017737Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/trainer.py", line 1805, in compute_loss
2021-07-26T14:09:27.84202311Z       outputs = model(**inputs)
2021-07-26T14:09:27.842028183Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
2021-07-26T14:09:27.842034154Z      result = self.forward(*input, **kwargs)
2021-07-26T14:09:27.842039413Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
2021-07-26T14:09:27.842045122Z      outputs = self.parallel_apply(replicas, inputs, kwargs)
2021-07-26T14:09:27.84205038Z     File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
2021-07-26T14:09:27.842055852Z      return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
2021-07-26T14:09:27.842061165Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
2021-07-26T14:09:27.842066725Z      output.reraise()
2021-07-26T14:09:27.842071565Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
2021-07-26T14:09:27.842077398Z      raise self.exc_type(msg)
2021-07-26T14:09:27.842082546Z  StopIteration: Caught StopIteration in replica 0 on device 0.
2021-07-26T14:09:27.842087891Z  Original Traceback (most recent call last):
2021-07-26T14:09:27.842093056Z    File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
2021-07-26T14:09:27.842098477Z      output = module(*input, **kwargs)
2021-07-26T14:09:27.84210327Z     File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
2021-07-26T14:09:27.842108627Z      result = self.forward(*input, **kwargs)
2021-07-26T14:09:27.842113465Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/models/openai/modeling_openai.py", line 581, in forward
2021-07-26T14:09:27.842119416Z      transformer_outputs = self.transformer(
2021-07-26T14:09:27.8421263Z      File "/home/user/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
2021-07-26T14:09:27.842132244Z      result = self.forward(*input, **kwargs)
2021-07-26T14:09:27.842137575Z    File "/home/user/miniconda/lib/python3.8/site-packages/transformers/models/openai/modeling_openai.py", line 487, in forward
2021-07-26T14:09:27.842147909Z      attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
2021-07-26T14:09:27.842153517Z  StopIteration
2021-07-26T14:09:27.842158291Z  
2021-07-26T14:09:28.598937Z 
     0%|          | 0/50000 [00:02<?, ?it/s]

Expected behavior

The same as run_clm.py with a single GPU.

sgugger commented 3 years ago

I am unable to reproduce the problem (also you seem to have made changes to the run_clm script since it does not accept those arguments: --method range --source fi.json --from_scratch) but in general, PyTorch discourages the use of DataParallel for multiGPU, so you should try to see if using DistributedDataParallel (by launching the script with torch.distributed.launch) works better?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.