IndexError: index out of bound, MLM+XLA (pre-training)

Environment info

transformers version: 4.9.0.dev0
Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.9.0+cu102 (False)
Tensorflow version (GPU?): 2.5.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
Jax version: 0.2.13
JaxLib version: 0.1.66
Using GPU in script?: False
Using distributed or parallel set-up in script?: False (Only TPU cores)

Who can help

Not sure who might be the most appropriate person

Information

Model I am using (Bert, XLNet ...): BigBird (MLM)

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

This is an error with the MLM script (PyTorch) for attempting to pre-train BigBird on TPUs over XLA. The dataset in question is a custom dataset, and the model config and tokenizer has been initialized appropriately.

This is a continuation of this unanswered Forum post that faces the same error.

Command used to run the script:-

%%bash
python xla_spawn.py --num_cores=8 ./run_mlm.py --output_dir="./results" \
    --model_type="big_bird" \
    --config_name="./config" \
    --tokenizer_name="./tokenizer" \
    --train_file="./dataset.txt" \
    --validation_file="./val.txt" \
    --line_by_line="True" \
    --max_seq_length="16000" \
    --weight_decay="0.01" \
    --per_device_train_batch_size="1" \
    --per_device_eval_batch_size="1" \
    --learning_rate="3e-4" \
    --tpu_num_cores='8' \
    --warmup_steps="1000" \
    --overwrite_output_dir \
    --pad_to_max_length \
    --num_train_epochs="5" \
    --adam_beta1="0.9" \
    --adam_beta2="0.98" \
    --do_train \
    --do_eval \
    --logging_steps="50" \
    --evaluation_strategy="steps" \
    --eval_accumulation_steps='10' \
    --report_to="tensorboard" \
    --logging_dir='./logs' \
    --save_strategy="epoch" \
    --load_best_model_at_end='True' \
    --metric_for_best_model='validation' \
    --preprocessing_num_workers='15'

I am facing two errors to be precise,

Exception in device=TPU:0: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1006, in main_process_first
    yield
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in map
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in <dictcomp>
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in map
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in <listcomp>
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2664, in shard
    writer_batch_size=writer_batch_size,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 186, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 397, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2254, in select
    return self._new_dataset_with_indices(indices_buffer=buf_writer.getvalue(), fingerprint=new_fingerprint)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2170, in _new_dataset_with_indices
    fingerprint=fingerprint,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 297, in __init__
    self._indices.column(0)[0].type
  File "pyarrow/table.pxi", line 162, in pyarrow.lib.ChunkedArray.__getitem__
  File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/run_mlm.py", line 529, in _mp_fn
    main()
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1011, in main_process_first
    torch.distributed.barrier()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I haven't modified the script to call the init_process_group yet, focusing on the earlier error of index out of bounds. Clearly, the problem is arising from my own dataset - which was working before however. Interestingly, we get it when its in the tokenizing stage.

At some point, when constructing the arrow dataset its failing. I have no idea about Apache Arrow, so I can't debug further :sweat_smile:

As for the dataset to use, A few simple lines of code with random numbers would be more than enough to reproduce the dataset.

!touch dataset.txt
import random
f = open('./dataset.txt', 'w')

for lines in range(50):
    f.write(' '.join(m for m in [str(random.randint(0, 40000)) for i in range(16000)]) + '\n')  #16000 words/(numbers) in one line, with random numbers from 0-40000 only.

f.close()

Can anyone give me some guidance on where should I start to investigate the error and some possible leads as to the origin? Any ideas how I can solve it?

Maybe @lhoestq has an idea for the error in datasets

@lhoestq Any possible leads as to who can solve this bug?

This is the full traceback BTW, If it may help things going. I am also willing to create a reproducible Colab if you guys want:-

06/28/2021 17:23:13 - WARNING - run_mlm - Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
06/28/2021 17:23:13 - WARNING - datasets.builder - Using custom data configuration default-e8bc7b301aa1b353
06/28/2021 17:23:13 - WARNING - datasets.builder - Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...
Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.

WARNING:root:TPU has started up successfully with version pytorch-1.9
WARNING:root:TPU has started up successfully with version pytorch-1.9
WARNING:run_mlm:Process rank: -1, device: xla:1, n_gpu: 0distributed training: False, 16-bits training: False
INFO:run_mlm:Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.98,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=10,
eval_steps=50,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=True,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0003,
length_column_name=length,
load_best_model_at_end=True,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=./logs,
logging_first_step=False,
logging_steps=50,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=validation,
mp_parameters=,
no_cuda=False,
num_train_epochs=5.0,
output_dir=./results,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=results,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=./results,
save_steps=500,
save_strategy=IntervalStrategy.EPOCH,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=8,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=1000,
weight_decay=0.01,
)
WARNING:datasets.builder:Using custom data configuration default-e8bc7b301aa1b353
INFO:datasets.utils.filelock:Lock 139795201622480 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.utils.filelock:Lock 139795201622480 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.utils.filelock:Lock 139795201622864 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.builder:Generating dataset text (/root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
100%|██████████| 2/2 [00:00<00:00, 2330.17it/s]
INFO:datasets.utils.download_manager:Downloading took 0.0 min
INFO:datasets.utils.download_manager:Checksum Computation took 0.0 min
100%|██████████| 2/2 [00:00<00:00, 920.91it/s]
INFO:datasets.utils.info_utils:Unable to verify checksums.
INFO:datasets.builder:Generating split train
INFO:datasets.arrow_writer:Done writing 8 examples in 172 bytes /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-train.arrow.
INFO:datasets.builder:Generating split validation
INFO:datasets.arrow_writer:Done writing 8 examples in 172 bytes /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-validation.arrow.
INFO:datasets.utils.info_utils:Unable to verify splits sizes.
INFO:datasets.utils.filelock:Lock 139795201625808 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
INFO:datasets.utils.filelock:Lock 139795201625808 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
INFO:datasets.utils.filelock:Lock 139795201622864 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.builder:Constructing Dataset for split train, validation, from /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5
100%|██████████| 2/2 [00:00<00:00, 458.74it/s]
[INFO|configuration_utils.py:528] 2021-06-28 17:23:13,619 >> loading configuration file ./config/config.json
[INFO|configuration_utils.py:566] 2021-06-28 17:23:13,619 >> Model config BigBirdConfig {
  "architectures": [
    "BigBirdForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_type": "block_sparse",
  "block_size": 64,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 16000,
  "model_type": "big_bird",
  "num_attention_heads": 4,
  "num_hidden_layers": 4,
  "num_random_blocks": 3,
  "pad_token_id": 0,
  "rescale_embeddings": false,
  "sep_token_id": 66,
  "transformers_version": "4.9.0.dev0",
  "type_vocab_size": 2,
  "use_bias": true,
  "use_cache": true,
  "vocab_size": 40000
}

[INFO|tokenization_utils_base.py:1651] 2021-06-28 17:23:13,620 >> Didn't find file ./tokenizer/spiece.model. We won't load it.
[INFO|tokenization_utils_base.py:1651] 2021-06-28 17:23:13,620 >> Didn't find file ./tokenizer/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/tokenizer.json
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/special_tokens_map.json
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/tokenizer_config.json
INFO:run_mlm:Training new model from scratch
Exception in device=TPU:6: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/run_mlm.py", line 529, in _mp_fn
    main()
  File "/content/run_mlm.py", line 386, in main
    with training_args.main_process_first(desc="dataset map tokenization"):
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1005, in main_process_first
    torch.distributed.barrier()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 0 indices in 0 bytes .
Exception in device=TPU:0: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1006, in main_process_first
    yield
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in map
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in <dictcomp>
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in map
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in <listcomp>
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2664, in shard
    writer_batch_size=writer_batch_size,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 186, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 397, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2254, in select
    return self._new_dataset_with_indices(indices_buffer=buf_writer.getvalue(), fingerprint=new_fingerprint)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2170, in _new_dataset_with_indices
    fingerprint=fingerprint,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 297, in __init__
    self._indices.column(0)[0].type
  File "pyarrow/table.pxi", line 162, in pyarrow.lib.ChunkedArray.__getitem__
  File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/run_mlm.py", line 529, in _mp_fn
    main()
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1011, in main_process_first
    torch.distributed.barrier()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "xla_spawn.py", line 85, in <module>
    main()
  File "xla_spawn.py", line 81, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with exit code 17

Hi ! This might be because num_proc is set to a value higher than the size of the dataset (so you end up with an empty dataset in one process). This has recently been solved by this PR https://github.com/huggingface/datasets/pull/2566. There will be a new release of datasets today to make this fix available. In the meantime, you can try using a bigger dataset or reduce the number of data processing workers.

Hmmm...my dataset is about 25k sequences, which I cut down to 15k to save memory :thinking: so the num_proc shouldn't pose any issue. Right now, following up on your suggestion I ve set it to the default.

Anyways, following up with the suggestion made by @LysandreJik, it seems that there might be some inconsistency while creating the dataset - putting it at a max_length of 512 and a few other flags for grad accumulation seems that it can train properly.

Could you try this out for me: set the max_seq_length value to something low, like 512 or 256. Does it still crash then?

For such lower values, it definitely doesn't crash which means you might be right. I would look to double-check my dataset generation process, but it still irks me why I can't see max_seq_length in the accepted TrainingArguments. Also, even if there aren't enough tokens to generate the require 16k limit, why doesn't pad_to_max_length flag act here in this case, and pad till the max length?

In such case, should I crop long sequences and pad smaller sequences manually - or is this supposed to be done automatically by the dataset processing part of the script?

it still irks me why I can't see max_seq_length in the accepted TrainingArguments.

max_seq_length isn't a TrainingArguments, it's a DataTrainingArguments. The difference is that the former is used by the Trainer, while the latter is only used by the script to do pre/post-processing, and is not passed to the Trainer.

why doesn't pad_to_max_length flag act here in this case, and pad till the max length?

I'm thinking the issue is happening earlier than the pad_to_max_length flax is consumed. I can reproduce with the following:

echo "This is a random sentence" > small_file.txt
python ~/transformers/examples/pytorch/language-modeling/run_mlm.py \
  --output_dir=output_dir \
  --model_name_or_path=google/bigbird-roberta-base \
  --train_file=small_file.txt \
  --do_train

The error comes from the dataset map that is calling the group_text method. This method tries to put all of the tokenized examples in the result dictionary, but drops the small remainder. As we don't have enough data to complete a single sequence, then this method returns an empty result:

{'attention_mask': [], 'input_ids': [], 'special_tokens_mask': []}

@sgugger can chime in if my approach is wrong, but the following modifications to the group_texts method seems to do the trick:

      def group_texts(examples):
            # Concatenate all texts.
            concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
            total_length = len(concatenated_examples[list(examples.keys())[0]])
            # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
            # customize this part to your needs.
-            total_length = (total_length // max_seq_length) * max_seq_length
+            truncated_total_length = (total_length // max_seq_length) * max_seq_length
            # Split by chunks of max_len.
-            result = {
-                k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
-                for k, t in concatenated_examples.items()
-            }
+            if total_length == 0:
+                result = {
+                    k: [t[i : i + max_seq_length] for i in range(0, truncated_total_length, max_seq_length)]
+                    for k, t in concatenated_examples.items()
+                }
+            else:
+                result = {
+                    k: [t[i: i + max_seq_length] for i in range(0, total_length, max_seq_length)]
+                    for k, t in concatenated_examples.items()
+                }
            return result

That clears up a lot of things @LysandreJik! Thanx a ton :rocket: :cake: :1st_place_medal:

~~Just a minor peek, When running the scripts it apparently doesn't log anything to the Colab's cell output. tried using different logging levels and setting to defaults to no avail~~ (no bother, simply bash piped it to a file to save time and tail -f to view updates to file in real-time)

I don't understand your diff @LysandreJik . If total_length==0 then truncated_total_length is also 0. I think you meant something more like this maybe?

- total_length = (total_length // max_seq_length) * max_seq_length
+ if total_length >= max_seq_length:
+      total_length = (total_length // max_seq_length) * max_seq_length

Ah I think I did a typo when copying the code, my local code has the following: if truncated_total_length != 0: instead of if total_length == 0:.

This way, if the truncated total length is equal to 0 (like in this case), then it will use the total_length (which is of 7) to create the example.

If the truncated total length is not 0, then it will use this value to create the example; which was the case before.

Feel free to modify as you wish so that it's clearer for you!

Yes, then it's equivalent to my suggestion. Thanks!

@LysandreJik I may be misunderstanding how argument parsing works, but for flags like evaluation_strategy, it doesn't seem that the script parses it at all? I have a logging problem (https://discuss.huggingface.co/t/no-step-wise-logging-for-xla-mlm-scripts-in-colab-jupyter/8134) which seems to ignore the arguments/fails to override them. I am getting log of loss only at the start of epoch (0.19) somewhere again (epoch-1.89) and never again, when set for 5 epochs.

This seems strange, nor can I judge my models as to how they are performing. any ideas?

huggingface / transformers