huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.77k stars 26.23k forks source link

Distributed TPU training with run_mlm duplicate data #12883

Closed alierenak closed 3 years ago

alierenak commented 3 years ago

Environment info

Who can help

@sgugger @patil-suraj

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

I have modified small things in examples/pytorch/language-modeling/run_mlm_no_trainer.py and changes as follow (can be reached at https://github.com/akalieren/transformers-master)

  1. Defined mp_fn to training script.
  2. Added streaming_data=True to Dataset Class
  3. Deleted tpu_num_cores argument from xla_spawn.py sys.args since it throw arrow.

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Clone modified script git clone https://github.com/akalieren/transformers-master
  2. export XRT_TPU_CONFIG="localservice;0;localhost:51011"
  3. Install required libraries (I did not add extra installments to requirements.txt to highlight they are not stated in official example)
    pip install transformers-master
    pip install .
    pip install -r examples/pytorch/language-modeling/requirements.txt
    pip install accelerate
    pip install datasets[streaming]
  4. Run command
    python3 examples/pytorch/xla_spawn.py --num_cores 8 examples/pytorch/language-modeling/run_mlm_no_trainer.py   --model_type "roberta" --per_device_eval_batch_size 512 --per_device_train_batch_size 512 --max_train_steps 1000000 --preprocessing_num_workers 50 --pad_to_max_length  --tokenizer_name "./tokenizers/Roberta/"  --dataset_name='oscar'  --dataset_config_name='unshuffled_deduplicated_fr' --data_streaming=True --max_seq_length 512   --line_by_line=True

Note: Without xla_spawn, Accelerator use only one cores. Thats why I changed, with 1 core it is running but slow

2021-07-26 00:30:54.355600: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2021-07-26 00:30:54.355659: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
07/26/2021 00:31:13 - INFO - run_mlm_no_trainer - Distributed environment: TPU
Num processes: 8
Process index: 0
Local process index: 0
Device: xla:1
Use FP16 precision: False

Downloading and preparing dataset oscar/unshuffled_deduplicated_tr (download: 9.68 GiB, generated: 26.43 GiB, post-processed: Unknown size, total: 36.10 GiB) to /home/akali/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_tr/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2...
07/26/2021 00:31:20 - INFO - run_mlm_no_trainer - Distributed environment: TPU
Num processes: 8
Process index: 1
Local process index: 1
Device: xla:0
Use FP16 precision: False

07/26/2021 00:31:20 - INFO - run_mlm_no_trainer - Distributed environment: TPU
Num processes: 8
Process index: 5
Local process index: 5
Device: xla:0
Use FP16 precision: False

07/26/2021 00:31:20 - INFO - run_mlm_no_trainer - Distributed environment: TPU
Num processes: 8
Process index: 7
Local process index: 7
Device: xla:0
Use FP16 precision: False

07/26/2021 00:31:20 - INFO - run_mlm_no_trainer - Distributed environment: TPU
Num processes: 8
Process index: 6
Local process index: 6
Device: xla:0
Use FP16 precision: False

07/26/2021 00:31:21 - INFO - run_mlm_no_trainer - Distributed environment: TPU
Num processes: 8
Process index: 2
Local process index: 2
Device: xla:0
Use FP16 precision: False

07/26/2021 00:31:21 - INFO - run_mlm_no_trainer - Distributed environment: TPU
Num processes: 8
Process index: 4
Local process index: 4
Device: xla:0
Use FP16 precision: False

07/26/2021 00:31:23 - INFO - run_mlm_no_trainer - Distributed environment: TPU
Num processes: 8
Process index: 3
Local process index: 3
Device: xla:0
Use FP16 precision: False

0 examples [00:00, ? examples/s]07/26/2021 00:31:44 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/657d72dc352d822d0496bb9f519cf0de87b87064d56024d9d1ac5585568125b1
718146 examples [00:48, 14431.60 examples/s]07/26/2021 00:32:32 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/f9b566f31181a53d426a2dc982a1b1de06cc92541de83cee688e5c57f4874300
1471415 examples [01:36, 13302.22 examples/s]07/26/2021 00:33:21 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/21f0672cc841442e067c7ea57471788dbd350f889acbd8028e75edb9efcacddb
2229278 examples [02:24, 16466.88 examples/s]07/26/2021 00:34:09 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/c027123c743fb1e0079bcd3be75f0ba6be89c6997f6b000e97c33f9c3d9c2742
2997743 examples [03:13, 18057.68 examples/s]07/26/2021 00:34:58 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/d7cc7a7389a8187b043cf359794e6fdc7783d5d0b6e7d737381e89d34c25e441
3772944 examples [04:02, 15671.97 examples/s]07/26/2021 00:35:46 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/a0175299b2eb4767f27e4f73c6848609be453fa5eb8d36dd6f8ecfd2c60a1e01
4569497 examples [04:51, 18017.92 examples/s]07/26/2021 00:36:35 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/6b432b7a552ccc65da0810808506bb7570162447776507b2b47319a230b48aa3
5356241 examples [05:39, 16205.13 examples/s]07/26/2021 00:37:24 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/ef34899af5cac3b75a798286fad2be831177c0833dab12c19c139b694d8c3544
6151458 examples [06:29, 11766.89 examples/s]07/26/2021 00:38:14 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/9926c88e0b8a2013f57aaef129cb9978ff129b8bfb3408c1194852c806249f9d
6957212 examples [07:18, 18684.33 examples/s]07/26/2021 00:39:03 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/aae79457ef2f44cd9ef24584b894c033d9099e6bc8e15b661a349cc185a230d7
7763558 examples [08:07, 16309.71 examples/s]07/26/2021 00:39:52 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/0274c31e96e2728161263b15aa4da982825eec91c7b0693756a890e76d1167c4
8565051 examples [08:57, 17289.47 examples/s]07/26/2021 00:40:41 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/f6423f5486261f771097352c7e2ae07643ad0f2fcf5f5d68c6a9921f8bd1e6a3
9397678 examples [09:46, 16643.61 examples/s]07/26/2021 00:41:30 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/2edc5ca535c1ea46aaacebf7f68a3553aa5d92b70e574f05709fa02dc52b5f4e
10231465 examples [10:36, 12871.41 examples/s]07/26/2021 00:42:20 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/3a06d248b02355ecdcf097df97a9e670db72c42456df9d04b15d4187933263ed
11075179 examples [11:26, 16567.73 examples/s]07/26/2021 00:43:11 - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = /home/akali/.cache/huggingface/datasets/downloads/0e3af1310ea118f4a5e8c13b40a561ae20ba209ae196d633a68155af35ec049c
Dataset oscar downloaded and prepared to /home/akali/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_tr/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2. Subsequent calls will reuse this data.
07/26/2021 00:43:42 - WARNING - datasets.builder - Reusing dataset oscar (/home/akali/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_tr/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2)
07/26/2021 00:43:42 - WARNING - run_mlm_no_trainer - You are instantiating a new config instance from scratch.
loading configuration file ./tokenizers/Roberta/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

Didn't find file ./tokenizers/Roberta/tokenizer.json. We won't load it.
Didn't find file ./tokenizers/Roberta/added_tokens.json. We won't load it.
loading file ./tokenizers/Roberta/vocab.json
loading file ./tokenizers/Roberta/merges.txt
loading file None
loading file None
loading file ./tokenizers/Roberta/special_tokens_map.json
loading file ./tokenizers/Roberta/tokenizer_config.json
loading configuration file ./tokenizers/Roberta/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./tokenizers/Roberta/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

#Β AFTER THIS POINT:
Script started to print tqdm process multiple times like that:
----> LOOK HERE  Running tokenizer on dataset line_by_line #43:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:20<51:05, 17.22s/ba]
Running tokenizer on dataset line_by_line #36:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:24<51:20, 17.30s/ba]
Running tokenizer on dataset line_by_line #29:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:28<51:37, 17.40s/ba]
Running tokenizer on dataset line_by_line #38:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:22<51:15, 17.28s/ba]
Running tokenizer on dataset line_by_line #5:  18%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                                  | 39/221 [12:33<58:34, 19.31s/ba]
Running tokenizer on dataset line_by_line #21:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:30<51:45, 17.45s/ba]
Running tokenizer on dataset line_by_line #46:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:19<51:01, 17.20s/ba]
Running tokenizer on dataset line_by_line #38:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:25<51:25, 17.34s/ba]
Running tokenizer on dataset line_by_line #42:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:23<51:19, 17.30s/ba]
Running tokenizer on dataset line_by_line #35:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:26<51:31, 17.37s/ba]
Running tokenizer on dataset line_by_line #21:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:30<51:48, 17.46s/ba]
Running tokenizer on dataset line_by_line #45:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:23<51:17, 17.29s/ba]
Running tokenizer on dataset line_by_line #35:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                               | 43/221 [12:27<51:34, 17.38s/ba]
----> AND HERE Running tokenizer on dataset line_by_line #43:  18%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   

As it can be seen processor 43 printed 2 times but their percentage is inconsistent. Since it can't be decreased, I think it is preprocessing in for each core.

Expected behavior

I expected to run training script with 8 cores with normal speed. But it is stoped at this point and not continue from here even without small changes.

sgugger commented 3 years ago

Dataset streaming has not been tested on any of the examples, so I'm not sure it works, especially for distributed training on TPUs.

alierenak commented 3 years ago

I am working on this feature for several days. Especially, I am trying to implement Iterable Dataset which reads preprocessed data from Cloud. Is the problem about streaming or Iterable Dataset, you think? However, using Pytorch Iterable Dataset in distributed training could be tricky as it can be seen from this issue.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.