TF : tensor mismatch error in training with opus100 and t5-small

SoyGema commented 1 year ago

System Info

transformers ==4.31.0.dev0 tensorflow-macos==2.10.0

Hello there! 👋 Thanks for creating examples for the Translation task!

Context

Im going through run_translation.py example modified with opus100 dataset. Launching the script with flags listed below.

python train_model.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang ro \
    --source_prefix "translate English to Romanian: " \
    --dataset_name opus100 \
    --dataset_config_name en-ro \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=16 \
    --per_device_eval_batch_size=16 \
    --overwrite_output_dir

Error

All dataset feature engineering seems to display well, It starts training but at some point, there is a tensor mismatch error in training.

Shape of tensor args_0 [16,128] is not compatible with expected shape [16,64].
         [[{{node EnsureShape_1}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]] [Op:__inference_train_function_17297]

Any hints on how Shall I reshape this? At some point, I thought it was something with preprocessing, but it starts training, so a little bit confused... I also explored wtm16 (example tested and working) during #24579 and when I go 2 the Hub, it seems to have the same structure and partitions as opus100.

Thanks for the time dedicated to this!🙂 and for the help! Looking forward to get all this working, and share it in PyCon Spain keynote this year!

Who can help?

@gante

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Launch training with config

python train_model.py \
--model_name_or_path t5-small \
--do_train \
--do_eval \
--source_lang en \
--target_lang ro \
--source_prefix "translate English to Romanian: " \
--dataset_name opus100 \
--dataset_config_name en-ro \
--output_dir /tmp/tst-translation \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16 \
--overwrite_output_dir

Expected behavior

Training is not interrupted.

ydshieh commented 1 year ago

This looks like a dataset issue, which is not in the scope of transformers GitHub pages.

However, if you can provide a full log error + the content of train_model.py, we might be able to have a quick look.

SoyGema commented 1 year ago

Hello there @ydshieh . Thanks for your time 🙏🙏 You can find full script here

Full Log

07/06/2023 17:59:34 - INFO - __main__ - Training/evaluation parameters TFTrainingArguments(
_n_gpu=-1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gcp_project=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/tmp/tst-translation/runs/Jul06_17-59-34_mbp-de-gema.lan,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=/tmp/tst-translation,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=16,
per_device_train_batch_size=16,
poly_power=1.0,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['mlflow', 'tensorboard'],
resume_from_checkpoint=None,
run_name=/tmp/tst-translation,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_name=None,
tpu_num_cores=None,
tpu_zone=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xla=False,
xpu_backend=None,
)
07/06/2023 17:59:35 - INFO - datasets.info - Loading Dataset Infos from /Users/gema/.cache/huggingface/modules/datasets_modules/datasets/opus100/256f3196b69901fb0c79810ef468e2c4ed84fbd563719920b1ff1fdc750f7704
07/06/2023 17:59:35 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
07/06/2023 17:59:35 - INFO - datasets.info - Loading Dataset info from /Users/gema/.cache/huggingface/datasets/opus100/en-ro/0.0.0/256f3196b69901fb0c79810ef468e2c4ed84fbd563719920b1ff1fdc750f7704
07/06/2023 17:59:35 - WARNING - datasets.builder - Found cached dataset opus100 (/Users/gema/.cache/huggingface/datasets/opus100/en-ro/0.0.0/256f3196b69901fb0c79810ef468e2c4ed84fbd563719920b1ff1fdc750f7704)
07/06/2023 17:59:35 - INFO - datasets.info - Loading Dataset info from /Users/gema/.cache/huggingface/datasets/opus100/en-ro/0.0.0/256f3196b69901fb0c79810ef468e2c4ed84fbd563719920b1ff1fdc750f7704
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 33.24it/s]
loading configuration file t5-small/config.json
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_pt": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Portuguese: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.31.0.dev0",
  "use_cache": true,
  "vocab_size": 32128
}

loading file spiece.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
07/06/2023 17:59:36 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /Users/gema/.cache/huggingface/datasets/opus100/en-ro/0.0.0/256f3196b69901fb0c79810ef468e2c4ed84fbd563719920b1ff1fdc750f7704/cache-107d5d31727344a2.arrow
Running tokenizer on validation dataset:   0%|                                                                          | 0/2000 [00:00<?, ? examples/s]07/06/2023 17:59:36 - INFO - datasets.arrow_dataset - Caching processed dataset at /Users/gema/.cache/huggingface/datasets/opus100/en-ro/0.0.0/256f3196b69901fb0c79810ef468e2c4ed84fbd563719920b1ff1fdc750f7704/cache-e8cb6f4c7ff7ad3e.arrow
Tensorflow: setting up strategy                                                                                                                         
loading weights file t5-small/model.safetensors
Generate config GenerationConfig {
  "_from_model_config": true,
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.31.0.dev0"
}

Loaded 60,506,624 parameters in the TF 2.0 model.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss. You can also specify `loss='auto'` to get the internal loss without printing this info string.
07/06/2023 17:59:38 - INFO - __main__ - ***** Running training *****
07/06/2023 17:59:38 - INFO - __main__ -   Num examples = 1000000
07/06/2023 17:59:38 - INFO - __main__ -   Num Epochs = 3.0
07/06/2023 17:59:38 - INFO - __main__ -   Instantaneous batch size per device = 16
07/06/2023 17:59:38 - INFO - __main__ -   Total train batch size = 16
07/06/2023 17:59:38 - INFO - __main__ -   Total optimization steps = 187500
2023-07-06 17:59:38.328410: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-07-06 17:59:38.353957: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/3
   18/62500 [..............................] - ETA: 21:26:35 - loss: 2.2246Traceback (most recent call last):
  File "/Users/gema/Documents/The-Lord-of-The-Words-The-two-frameworks/src/models/train_model.py", line 730, in <module>
    main()
  File "/Users/gema/Documents/The-Lord-of-The-Words-The-two-frameworks/src/models/train_model.py", line 683, in main
    history = model.fit(tf_train_dataset, epochs=int(training_args.num_train_epochs), callbacks=callbacks)
  File "/Users/gema/miniforge3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/gema/miniforge3/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Shape of tensor args_0 [16,128] is not compatible with expected shape [16,64].
         [[{{node EnsureShape_1}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]] [Op:__inference_train_function_17297]

For the future, I will go with the tailored example for the forum and maybe shall be redirected there. Let me know if at some point this is a suitable issue for datasets in this case. 🧭🗺️ Thanks for the time dedicated to this, I really appreciate it, and my apologies for the inconvenience.

ydshieh commented 1 year ago

@Rocketknight1

Do you know why

        if "cols_to_retain" in list(inspect.signature(dataset._get_output_signature).parameters.keys()):
            output_signature, _ = dataset._get_output_signature(
                dataset,
                batch_size=None,
                collate_fn=collate_fn,
                collate_fn_args=collate_fn_args,
                cols_to_retain=model_inputs,
            )

gives output_signature

{'input_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'labels': TensorSpec(shape=(None, 64), dtype=tf.int64, name=None), 'decoder_input_ids': TensorSpec(shape=(None, 64), dtype=tf.int64, name=None)}

which has a fixed sequence length 64 in labels and decoder_input_ids?

FYI: the sequences in dataset have different lengths in each element.

Rocketknight1 commented 1 year ago

@ydshieh We actually generate those shapes empirically by grabbing several batches from the dataset, which is not ideal but usually works. Do almost all samples from the dataset have a post-padding decoder_input_ids length of 64, but some don't? That might trigger this issue. If that turns out to be the case, let me know - I've been wary of that code for a while, so this might be a good time to try a fix!

SoyGema commented 1 year ago

Hello there. Thanks again for keeping this issue open. 🙏 Managed to solved the issue . Im putting it here before closing. Hopefully this can give some light to the question posted.

1. Script train_model.py

What I understand is that the preprocess_function , We call the tokenizer, that is having the padding and the max length associated

1.a ) Initially what I did is set max_source_length that fixes the length after tokenization to 64 . According to the docstring, larger sequences are truncated, and shorter are padded. IT TRAINS CORRECTLY . But then I thought that this could (please correct me if I'm wrong ) split the sequences when they are longer, therefore larger sentences could be cut, affecting to understanding context in translation in larger sentences.

2.b ) Then I discovered pad_to_max_length . What Im assuming here is that it pads taking into account the max sequence length, so I tried to set it to True and max_target_length to None . IT SEEMS TO BE TRAINING CORRECTLY as well. What Im understanding here is that Im padding WRT the max length.

Come what may, I gather to TRAIN the model with these two options. If anyone wants to keep this conversation or clarify some wrong hypothesis I might have, please come by #2 🙂 as I won´t consider proper to keep this issue here. 💗🤗

Thanks @ydshieh & @Rocketknight1

huggingface / transformers