huggingface / transformers

šŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.7k stars 26.94k forks source link

IndexError: index out of bound, MLM+XLA (pre-training) #12438

Closed neel04 closed 3 years ago

neel04 commented 3 years ago

Environment info

Who can help

Not sure who might be the most appropriate person

Information

Model I am using (Bert, XLNet ...): BigBird (MLM)

The problem arises when using:

The tasks I am working on is:

To reproduce

This is an error with the MLM script (PyTorch) for attempting to pre-train BigBird on TPUs over XLA. The dataset in question is a custom dataset, and the model config and tokenizer has been initialized appropriately.

This is a continuation of this unanswered Forum post that faces the same error.

Command used to run the script:-

%%bash
python xla_spawn.py --num_cores=8 ./run_mlm.py --output_dir="./results" \
    --model_type="big_bird" \
    --config_name="./config" \
    --tokenizer_name="./tokenizer" \
    --train_file="./dataset.txt" \
    --validation_file="./val.txt" \
    --line_by_line="True" \
    --max_seq_length="16000" \
    --weight_decay="0.01" \
    --per_device_train_batch_size="1" \
    --per_device_eval_batch_size="1" \
    --learning_rate="3e-4" \
    --tpu_num_cores='8' \
    --warmup_steps="1000" \
    --overwrite_output_dir \
    --pad_to_max_length \
    --num_train_epochs="5" \
    --adam_beta1="0.9" \
    --adam_beta2="0.98" \
    --do_train \
    --do_eval \
    --logging_steps="50" \
    --evaluation_strategy="steps" \
    --eval_accumulation_steps='10' \
    --report_to="tensorboard" \
    --logging_dir='./logs' \
    --save_strategy="epoch" \
    --load_best_model_at_end='True' \
    --metric_for_best_model='validation' \
    --preprocessing_num_workers='15'

I am facing two errors to be precise,

Exception in device=TPU:0: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1006, in main_process_first
    yield
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in map
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in <dictcomp>
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in map
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in <listcomp>
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2664, in shard
    writer_batch_size=writer_batch_size,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 186, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 397, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2254, in select
    return self._new_dataset_with_indices(indices_buffer=buf_writer.getvalue(), fingerprint=new_fingerprint)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2170, in _new_dataset_with_indices
    fingerprint=fingerprint,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 297, in __init__
    self._indices.column(0)[0].type
  File "pyarrow/table.pxi", line 162, in pyarrow.lib.ChunkedArray.__getitem__
  File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/run_mlm.py", line 529, in _mp_fn
    main()
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1011, in main_process_first
    torch.distributed.barrier()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I haven't modified the script to call the init_process_group yet, focusing on the earlier error of index out of bounds. Clearly, the problem is arising from my own dataset - which was working before however. Interestingly, we get it when its in the tokenizing stage.

At some point, when constructing the arrow dataset its failing. I have no idea about Apache Arrow, so I can't debug further :sweat_smile:

As for the dataset to use, A few simple lines of code with random numbers would be more than enough to reproduce the dataset.

!touch dataset.txt
import random
f = open('./dataset.txt', 'w')

for lines in range(50):
    f.write(' '.join(m for m in [str(random.randint(0, 40000)) for i in range(16000)]) + '\n')  #16000 words/(numbers) in one line, with random numbers from 0-40000 only.

f.close()

Can anyone give me some guidance on where should I start to investigate the error and some possible leads as to the origin? Any ideas how I can solve it?

LysandreJik commented 3 years ago

Maybe @lhoestq has an idea for the error in datasets

neel04 commented 3 years ago

@lhoestq Any possible leads as to who can solve this bug?

neel04 commented 3 years ago

This is the full traceback BTW, If it may help things going. I am also willing to create a reproducible Colab if you guys want:-

06/28/2021 17:23:13 - WARNING - run_mlm - Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
06/28/2021 17:23:13 - WARNING - datasets.builder - Using custom data configuration default-e8bc7b301aa1b353
06/28/2021 17:23:13 - WARNING - datasets.builder - Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...
Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.

WARNING:root:TPU has started up successfully with version pytorch-1.9
WARNING:root:TPU has started up successfully with version pytorch-1.9
WARNING:run_mlm:Process rank: -1, device: xla:1, n_gpu: 0distributed training: False, 16-bits training: False
INFO:run_mlm:Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.98,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=10,
eval_steps=50,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=True,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0003,
length_column_name=length,
load_best_model_at_end=True,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=./logs,
logging_first_step=False,
logging_steps=50,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=validation,
mp_parameters=,
no_cuda=False,
num_train_epochs=5.0,
output_dir=./results,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=results,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=./results,
save_steps=500,
save_strategy=IntervalStrategy.EPOCH,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=8,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=1000,
weight_decay=0.01,
)
WARNING:datasets.builder:Using custom data configuration default-e8bc7b301aa1b353
INFO:datasets.utils.filelock:Lock 139795201622480 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.utils.filelock:Lock 139795201622480 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.utils.filelock:Lock 139795201622864 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.builder:Generating dataset text (/root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2/2 [00:00<00:00, 2330.17it/s]
INFO:datasets.utils.download_manager:Downloading took 0.0 min
INFO:datasets.utils.download_manager:Checksum Computation took 0.0 min
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2/2 [00:00<00:00, 920.91it/s]
INFO:datasets.utils.info_utils:Unable to verify checksums.
INFO:datasets.builder:Generating split train
INFO:datasets.arrow_writer:Done writing 8 examples in 172 bytes /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-train.arrow.
INFO:datasets.builder:Generating split validation
INFO:datasets.arrow_writer:Done writing 8 examples in 172 bytes /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-validation.arrow.
INFO:datasets.utils.info_utils:Unable to verify splits sizes.
INFO:datasets.utils.filelock:Lock 139795201625808 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
INFO:datasets.utils.filelock:Lock 139795201625808 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
INFO:datasets.utils.filelock:Lock 139795201622864 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.builder:Constructing Dataset for split train, validation, from /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2/2 [00:00<00:00, 458.74it/s]
[INFO|configuration_utils.py:528] 2021-06-28 17:23:13,619 >> loading configuration file ./config/config.json
[INFO|configuration_utils.py:566] 2021-06-28 17:23:13,619 >> Model config BigBirdConfig {
  "architectures": [
    "BigBirdForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_type": "block_sparse",
  "block_size": 64,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 16000,
  "model_type": "big_bird",
  "num_attention_heads": 4,
  "num_hidden_layers": 4,
  "num_random_blocks": 3,
  "pad_token_id": 0,
  "rescale_embeddings": false,
  "sep_token_id": 66,
  "transformers_version": "4.9.0.dev0",
  "type_vocab_size": 2,
  "use_bias": true,
  "use_cache": true,
  "vocab_size": 40000
}

[INFO|tokenization_utils_base.py:1651] 2021-06-28 17:23:13,620 >> Didn't find file ./tokenizer/spiece.model. We won't load it.
[INFO|tokenization_utils_base.py:1651] 2021-06-28 17:23:13,620 >> Didn't find file ./tokenizer/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/tokenizer.json
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/special_tokens_map.json
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/tokenizer_config.json
INFO:run_mlm:Training new model from scratch
Exception in device=TPU:6: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/run_mlm.py", line 529, in _mp_fn
    main()
  File "/content/run_mlm.py", line 386, in main
    with training_args.main_process_first(desc="dataset map tokenization"):
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1005, in main_process_first
    torch.distributed.barrier()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 0 indices in 0 bytes .
Exception in device=TPU:0: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1006, in main_process_first
    yield
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in map
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in <dictcomp>
    for k, dataset in self.items()
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in map
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in <listcomp>
    for rank in range(num_proc)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2664, in shard
    writer_batch_size=writer_batch_size,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 186, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 397, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2254, in select
    return self._new_dataset_with_indices(indices_buffer=buf_writer.getvalue(), fingerprint=new_fingerprint)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2170, in _new_dataset_with_indices
    fingerprint=fingerprint,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 297, in __init__
    self._indices.column(0)[0].type
  File "pyarrow/table.pxi", line 162, in pyarrow.lib.ChunkedArray.__getitem__
  File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/run_mlm.py", line 529, in _mp_fn
    main()
  File "/content/run_mlm.py", line 393, in main
    desc="Running tokenizer on dataset line_by_line",
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1011, in main_process_first
    torch.distributed.barrier()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "xla_spawn.py", line 85, in <module>
    main()
  File "xla_spawn.py", line 81, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with exit code 17
lhoestq commented 3 years ago

Hi ! This might be because num_proc is set to a value higher than the size of the dataset (so you end up with an empty dataset in one process). This has recently been solved by this PR https://github.com/huggingface/datasets/pull/2566. There will be a new release of datasets today to make this fix available. In the meantime, you can try using a bigger dataset or reduce the number of data processing workers.

neel04 commented 3 years ago

Hmmm...my dataset is about 25k sequences, which I cut down to 15k to save memory :thinking: so the num_proc shouldn't pose any issue. Right now, following up on your suggestion I ve set it to the default.

Anyways, following up with the suggestion made by @LysandreJik, it seems that there might be some inconsistency while creating the dataset - putting it at a max_length of 512 and a few other flags for grad accumulation seems that it can train properly.

Could you try this out for me: set the max_seq_length value to something low, like 512 or 256. Does it still crash then?

For such lower values, it definitely doesn't crash which means you might be right. I would look to double-check my dataset generation process, but it still irks me why I can't see max_seq_length in the accepted TrainingArguments. Also, even if there aren't enough tokens to generate the require 16k limit, why doesn't pad_to_max_length flag act here in this case, and pad till the max length?

neel04 commented 3 years ago

In such case, should I crop long sequences and pad smaller sequences manually - or is this supposed to be done automatically by the dataset processing part of the script?

LysandreJik commented 3 years ago

it still irks me why I can't see max_seq_length in the accepted TrainingArguments.

max_seq_length isn't a TrainingArguments, it's a DataTrainingArguments. The difference is that the former is used by the Trainer, while the latter is only used by the script to do pre/post-processing, and is not passed to the Trainer.

why doesn't pad_to_max_length flag act here in this case, and pad till the max length?

I'm thinking the issue is happening earlier than the pad_to_max_length flax is consumed. I can reproduce with the following:

echo "This is a random sentence" > small_file.txt
python ~/transformers/examples/pytorch/language-modeling/run_mlm.py \
  --output_dir=output_dir \
  --model_name_or_path=google/bigbird-roberta-base \
  --train_file=small_file.txt \
  --do_train

The error comes from the dataset map that is calling the group_text method. This method tries to put all of the tokenized examples in the result dictionary, but drops the small remainder. As we don't have enough data to complete a single sequence, then this method returns an empty result:

{'attention_mask': [], 'input_ids': [], 'special_tokens_mask': []}

@sgugger can chime in if my approach is wrong, but the following modifications to the group_texts method seems to do the trick:

      def group_texts(examples):
            # Concatenate all texts.
            concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
            total_length = len(concatenated_examples[list(examples.keys())[0]])
            # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
            # customize this part to your needs.
-            total_length = (total_length // max_seq_length) * max_seq_length
+            truncated_total_length = (total_length // max_seq_length) * max_seq_length
            # Split by chunks of max_len.
-            result = {
-                k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
-                for k, t in concatenated_examples.items()
-            }
+            if total_length == 0:
+                result = {
+                    k: [t[i : i + max_seq_length] for i in range(0, truncated_total_length, max_seq_length)]
+                    for k, t in concatenated_examples.items()
+                }
+            else:
+                result = {
+                    k: [t[i: i + max_seq_length] for i in range(0, total_length, max_seq_length)]
+                    for k, t in concatenated_examples.items()
+                }
            return result
neel04 commented 3 years ago

That clears up a lot of things @LysandreJik! Thanx a ton :rocket: :cake: :1st_place_medal:

Just a minor peek, When running the scripts it apparently doesn't log anything to the Colab's cell output. tried using different logging levels and setting to defaults to no avail (no bother, simply bash piped it to a file to save time and tail -f to view updates to file in real-time)

sgugger commented 3 years ago

I don't understand your diff @LysandreJik . If total_length==0 then truncated_total_length is also 0. I think you meant something more like this maybe?

- total_length = (total_length // max_seq_length) * max_seq_length
+ if total_length >= max_seq_length:
+      total_length = (total_length // max_seq_length) * max_seq_length
LysandreJik commented 3 years ago

Ah I think I did a typo when copying the code, my local code has the following: if truncated_total_length != 0: instead of if total_length == 0:.

This way, if the truncated total length is equal to 0 (like in this case), then it will use the total_length (which is of 7) to create the example.

If the truncated total length is not 0, then it will use this value to create the example; which was the case before.

Feel free to modify as you wish so that it's clearer for you!

sgugger commented 3 years ago

Yes, then it's equivalent to my suggestion. Thanks!

neel04 commented 3 years ago

@LysandreJik I may be misunderstanding how argument parsing works, but for flags like evaluation_strategy, it doesn't seem that the script parses it at all? I have a logging problem (https://discuss.huggingface.co/t/no-step-wise-logging-for-xla-mlm-scripts-in-colab-jupyter/8134) which seems to ignore the arguments/fails to override them. I am getting log of loss only at the start of epoch (0.19) somewhere again (epoch-1.89) and never again, when set for 5 epochs.

This seems strange, nor can I judge my models as to how they are performing. any ideas?