Closed neel04 closed 3 years ago
Maybe @lhoestq has an idea for the error in datasets
@lhoestq Any possible leads as to who can solve this bug?
This is the full traceback BTW, If it may help things going. I am also willing to create a reproducible Colab if you guys want:-
06/28/2021 17:23:13 - WARNING - run_mlm - Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
06/28/2021 17:23:13 - WARNING - datasets.builder - Using custom data configuration default-e8bc7b301aa1b353
06/28/2021 17:23:13 - WARNING - datasets.builder - Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...
Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.
WARNING:root:TPU has started up successfully with version pytorch-1.9
WARNING:root:TPU has started up successfully with version pytorch-1.9
WARNING:run_mlm:Process rank: -1, device: xla:1, n_gpu: 0distributed training: False, 16-bits training: False
INFO:run_mlm:Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.98,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=10,
eval_steps=50,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=True,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0003,
length_column_name=length,
load_best_model_at_end=True,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=./logs,
logging_first_step=False,
logging_steps=50,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=validation,
mp_parameters=,
no_cuda=False,
num_train_epochs=5.0,
output_dir=./results,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=results,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=./results,
save_steps=500,
save_strategy=IntervalStrategy.EPOCH,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=8,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=1000,
weight_decay=0.01,
)
WARNING:datasets.builder:Using custom data configuration default-e8bc7b301aa1b353
INFO:datasets.utils.filelock:Lock 139795201622480 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.utils.filelock:Lock 139795201622480 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.utils.filelock:Lock 139795201622864 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.builder:Generating dataset text (/root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
100%|āāāāāāāāāā| 2/2 [00:00<00:00, 2330.17it/s]
INFO:datasets.utils.download_manager:Downloading took 0.0 min
INFO:datasets.utils.download_manager:Checksum Computation took 0.0 min
100%|āāāāāāāāāā| 2/2 [00:00<00:00, 920.91it/s]
INFO:datasets.utils.info_utils:Unable to verify checksums.
INFO:datasets.builder:Generating split train
INFO:datasets.arrow_writer:Done writing 8 examples in 172 bytes /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-train.arrow.
INFO:datasets.builder:Generating split validation
INFO:datasets.arrow_writer:Done writing 8 examples in 172 bytes /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete/text-validation.arrow.
INFO:datasets.utils.info_utils:Unable to verify splits sizes.
INFO:datasets.utils.filelock:Lock 139795201625808 acquired on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
INFO:datasets.utils.filelock:Lock 139795201625808 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.incomplete.lock
INFO:datasets.utils.filelock:Lock 139795201622864 released on /root/.cache/huggingface/datasets/_root_.cache_huggingface_datasets_text_default-e8bc7b301aa1b353_0.0.0_e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5.lock
INFO:datasets.builder:Constructing Dataset for split train, validation, from /root/.cache/huggingface/datasets/text/default-e8bc7b301aa1b353/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5
100%|āāāāāāāāāā| 2/2 [00:00<00:00, 458.74it/s]
[INFO|configuration_utils.py:528] 2021-06-28 17:23:13,619 >> loading configuration file ./config/config.json
[INFO|configuration_utils.py:566] 2021-06-28 17:23:13,619 >> Model config BigBirdConfig {
"architectures": [
"BigBirdForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"attention_type": "block_sparse",
"block_size": 64,
"bos_token_id": 1,
"eos_token_id": 2,
"gradient_checkpointing": false,
"hidden_act": "gelu_new",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 16000,
"model_type": "big_bird",
"num_attention_heads": 4,
"num_hidden_layers": 4,
"num_random_blocks": 3,
"pad_token_id": 0,
"rescale_embeddings": false,
"sep_token_id": 66,
"transformers_version": "4.9.0.dev0",
"type_vocab_size": 2,
"use_bias": true,
"use_cache": true,
"vocab_size": 40000
}
[INFO|tokenization_utils_base.py:1651] 2021-06-28 17:23:13,620 >> Didn't find file ./tokenizer/spiece.model. We won't load it.
[INFO|tokenization_utils_base.py:1651] 2021-06-28 17:23:13,620 >> Didn't find file ./tokenizer/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/tokenizer.json
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/special_tokens_map.json
[INFO|tokenization_utils_base.py:1715] 2021-06-28 17:23:13,620 >> loading file ./tokenizer/tokenizer_config.json
INFO:run_mlm:Training new model from scratch
Exception in device=TPU:6: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/run_mlm.py", line 529, in _mp_fn
main()
File "/content/run_mlm.py", line 386, in main
with training_args.main_process_first(desc="dataset map tokenization"):
File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1005, in main_process_first
torch.distributed.barrier()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
default_pg = _get_default_group()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 1 indices in 8 bytes .
INFO:datasets.arrow_writer:Done writing 0 indices in 0 bytes .
Exception in device=TPU:0: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1006, in main_process_first
yield
File "/content/run_mlm.py", line 393, in main
desc="Running tokenizer on dataset line_by_line",
File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in map
for k, dataset in self.items()
File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in <dictcomp>
for k, dataset in self.items()
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in map
for rank in range(num_proc)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in <listcomp>
for rank in range(num_proc)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2664, in shard
writer_batch_size=writer_batch_size,
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 186, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2254, in select
return self._new_dataset_with_indices(indices_buffer=buf_writer.getvalue(), fingerprint=new_fingerprint)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2170, in _new_dataset_with_indices
fingerprint=fingerprint,
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 297, in __init__
self._indices.column(0)[0].type
File "pyarrow/table.pxi", line 162, in pyarrow.lib.ChunkedArray.__getitem__
File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/run_mlm.py", line 529, in _mp_fn
main()
File "/content/run_mlm.py", line 393, in main
desc="Running tokenizer on dataset line_by_line",
File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1011, in main_process_first
torch.distributed.barrier()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
default_pg = _get_default_group()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
File "xla_spawn.py", line 85, in <module>
main()
File "xla_spawn.py", line 81, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
start_method=start_method)
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 144, in join
exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with exit code 17
Hi !
This might be because num_proc
is set to a value higher than the size of the dataset (so you end up with an empty dataset in one process).
This has recently been solved by this PR https://github.com/huggingface/datasets/pull/2566. There will be a new release of datasets
today to make this fix available. In the meantime, you can try using a bigger dataset or reduce the number of data processing workers.
Hmmm...my dataset is about 25k sequences, which I cut down to 15k to save memory :thinking: so the num_proc
shouldn't pose any issue. Right now, following up on your suggestion I ve set it to the default.
Anyways, following up with the suggestion made by @LysandreJik, it seems that there might be some inconsistency while creating the dataset - putting it at a max_length
of 512
and a few other flags for grad accumulation seems that it can train properly.
Could you try this out for me: set the max_seq_length value to something low, like 512 or 256. Does it still crash then?
For such lower values, it definitely doesn't crash which means you might be right. I would look to double-check my dataset generation process, but it still irks me why I can't see max_seq_length
in the accepted TrainingArguments. Also, even if there aren't enough tokens to generate the require 16k
limit, why doesn't pad_to_max_length
flag act here in this case, and pad till the max length?
In such case, should I crop long sequences and pad smaller sequences manually - or is this supposed to be done automatically by the dataset processing part of the script?
it still irks me why I can't see max_seq_length in the accepted TrainingArguments.
max_seq_length
isn't a TrainingArguments
, it's a DataTrainingArguments
. The difference is that the former is used by the Trainer
, while the latter is only used by the script to do pre/post-processing, and is not passed to the Trainer
.
why doesn't pad_to_max_length flag act here in this case, and pad till the max length?
I'm thinking the issue is happening earlier than the pad_to_max_length
flax is consumed. I can reproduce with the following:
echo "This is a random sentence" > small_file.txt
python ~/transformers/examples/pytorch/language-modeling/run_mlm.py \
--output_dir=output_dir \
--model_name_or_path=google/bigbird-roberta-base \
--train_file=small_file.txt \
--do_train
The error comes from the dataset map that is calling the group_text
method. This method tries to put all of the tokenized examples in the result
dictionary, but drops the small remainder. As we don't have enough data to complete a single sequence, then this method returns an empty result:
{'attention_mask': [], 'input_ids': [], 'special_tokens_mask': []}
@sgugger can chime in if my approach is wrong, but the following modifications to the group_texts
method seems to do the trick:
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
- total_length = (total_length // max_seq_length) * max_seq_length
+ truncated_total_length = (total_length // max_seq_length) * max_seq_length
# Split by chunks of max_len.
- result = {
- k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
- for k, t in concatenated_examples.items()
- }
+ if total_length == 0:
+ result = {
+ k: [t[i : i + max_seq_length] for i in range(0, truncated_total_length, max_seq_length)]
+ for k, t in concatenated_examples.items()
+ }
+ else:
+ result = {
+ k: [t[i: i + max_seq_length] for i in range(0, total_length, max_seq_length)]
+ for k, t in concatenated_examples.items()
+ }
return result
That clears up a lot of things @LysandreJik! Thanx a ton :rocket: :cake: :1st_place_medal:
Just a minor peek, When running the scripts it apparently doesn't log anything to the Colab's cell output. tried using different logging levels and setting to defaults to no avail (no bother, simply bash piped it to a file to save time and tail -f
to view updates to file in real-time)
I don't understand your diff @LysandreJik . If total_length==0
then truncated_total_length
is also 0. I think you meant something more like this maybe?
- total_length = (total_length // max_seq_length) * max_seq_length
+ if total_length >= max_seq_length:
+ total_length = (total_length // max_seq_length) * max_seq_length
Ah I think I did a typo when copying the code, my local code has the following:
if truncated_total_length != 0:
instead of if total_length == 0:
.
This way, if the truncated total length is equal to 0 (like in this case), then it will use the total_length
(which is of 7) to create the example.
If the truncated total length is not 0, then it will use this value to create the example; which was the case before.
Feel free to modify as you wish so that it's clearer for you!
Yes, then it's equivalent to my suggestion. Thanks!
@LysandreJik I may be misunderstanding how argument parsing works, but for flags like evaluation_strategy
, it doesn't seem that the script parses it at all? I have a logging problem (https://discuss.huggingface.co/t/no-step-wise-logging-for-xla-mlm-scripts-in-colab-jupyter/8134) which seems to ignore the arguments/fails to override them. I am getting log of loss only at the start of epoch (0.19
) somewhere again (epoch-1.89
) and never again, when set for 5 epochs.
This seems strange, nor can I judge my models as to how they are performing. any ideas?
Environment info
transformers
version: 4.9.0.dev0TPU
cores)Who can help
Not sure who might be the most appropriate person
Information
Model I am using (Bert, XLNet ...):
BigBird
(MLM)The problem arises when using:
The tasks I am working on is:
To reproduce
This is an error with the
MLM
script (PyTorch) for attempting to pre-train BigBird on TPUs over XLA. The dataset in question is a custom dataset, and the model config and tokenizer has been initialized appropriately.This is a continuation of this unanswered Forum post that faces the same error.
Command used to run the script:-
I am facing two errors to be precise,
I haven't modified the script to call the
init_process_group
yet, focusing on the earlier error of index out of bounds. Clearly, the problem is arising from my own dataset - which was working before however. Interestingly, we get it when its in the tokenizing stage.At some point, when constructing the arrow dataset its failing. I have no idea about Apache Arrow, so I can't debug further :sweat_smile:
As for the dataset to use, A few simple lines of code with random numbers would be more than enough to reproduce the dataset.
Can anyone give me some guidance on where should I start to investigate the error and some possible leads as to the origin? Any ideas how I can solve it?