EncoderDecoderModel with XLM-R

Bachstelze commented 3 months ago

System Info

private setup:

transformers version: 4.35.2
Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: 0.25.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Tesla V100
Using distributed or parallel set-up in script?: No

Google Colab setup:

transformers version: 4.38.2
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.29.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): 2.15.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.2 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Using GPU in script?: T4 GPU
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker @patrickvonplaten

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Those are the training steps for an instruction-tuned shared EncoderDecoder xlm-r model:

git clone https://gitlab.com/Bachstelze/instructionbert.git
pip install -r ./requirements.txt
create the dataset filtered down to the context size of xlm-r. The last command parameter would sort the training set by length. For a faster reproduction set it to False: python3 instructionbert/load_multilingual_dataset.py "FacebookAI/xlm-roberta-base" 512 False
try to train the encoderDecoderModel: python3 test_train_multilingual.py "FacebookAI/xlm-roberta-base" 512 True "adamw_bnb_8bit" 1 8 50000 0.0001 1 1 those ordered training parameters are changeable: model_name, maximal_length, bool_fp16, optimizer, dataloader_workers, batch_size, warmup, lr, accumulation, epochs

Expected behavior

The XLM-R model is supported as encoderDecoderModel, so it should train just like other smaller mBERT or monolingual Roberta models.

This "IndexError: index out of range in self" is thrown by training it with matching context size of the filtered dataset and model:

python3 test_train_multilingual.py "/media/data/models/xlm-roberta-base" 512 False "adamw_hf" 1 1 20000 0.00001 1 1
load the saved and prefiltered dataset from the csv
1110446
['Heestan waxaa qada Khalid Haref Ahmed \nOO ku Jiray Kooxdii Dur Dur!', "Habeen ma hurdoo\nAday horjoogoo\nDharaar ma hargalo\nAduun baabay helayee\nRuntii ku helayoo\nCaawaan iman iman\nOonkaan u liitay\nIga ba'ay harraadkiisa\n\nHannaan wanaageey\nHadal macaaneey\nWadnahaad haleeshayoo\nWaad hirgalaysaa\nRuntii ku helayoo\nMaantaan iman iman\nOonkaan u liitay\nIga ba'ay harraadkiisa\n\nOlolaha jacaylkeenna\nYididdiiladeeniyo\nuur midoo fiyowbaan\nku abaabulaynaa\nUbixii aan beernaan\nku intifaacsanaynaa\nCaawaan iman iman\nOonkaan u liitay\nIga ba'ay harraadkiisa\n\nAfar gu' iyo dheeraad\nAxdigaynu taagnay\nAyaan dantiyo guur\nu adkaynay gaarnoo\nMarwadayda noqotoo\nUbad daadahaysee\nCaawaan iman iman\nOonkaan u liitay\nIga ba'ay harraadkiisa..."]
The following encoder weights were not tied to the decoder ['roberta/pooler']
accumulated batch size: 1
data size without validation set: 1110346
max train steps: 1110346
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
  0%|                                                           | 0/1110346 [00:00<?, ?it/s]/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py:639: FutureWarning: Version v4.12.0 introduces a better way to train encoder-decoder models by computing the loss inside the encoder-decoder framework rather than in the decoder itself. You may observe training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0. The decoder_input_ids are now created based on the labels, no need to pass them yourself anymore.
  warnings.warn(DEPRECATION_WARNING, FutureWarning)
{'loss': 12.575, 'learning_rate': 5.0000000000000004e-08, 'epoch': 0.0}
{'eval_loss': 12.56555461883545, 'eval_runtime': 100.4938, 'eval_samples_per_second': 0.995, 'eval_steps_per_second': 0.995, 'learning_rate': 5.0000000000000004e-08, 'epoch': 0.0}
{'loss': 12.5526, 'learning_rate': 1.0000000000000001e-07, 'epoch': 0.0}
{'eval_loss': 12.492094993591309, 'eval_runtime': 99.7907, 'eval_samples_per_second': 1.002, 'eval_steps_per_second': 1.002, 'learning_rate': 1.0000000000000001e-07, 'epoch': 0.0}
{'loss': 12.4601, 'learning_rate': 1.5000000000000002e-07, 'epoch': 0.0}
{'eval_loss': 12.378839492797852, 'eval_runtime': 103.8677, 'eval_samples_per_second': 0.963, 'eval_steps_per_second': 0.963, 'learning_rate': 1.5000000000000002e-07, 'epoch': 0.0}
{'loss': 12.3297, 'learning_rate': 2.0000000000000002e-07, 'epoch': 0.0}
{'eval_loss': 12.233966827392578, 'eval_runtime': 105.7982, 'eval_samples_per_second': 0.945, 'eval_steps_per_second': 0.945, 'learning_rate': 2.0000000000000002e-07, 'epoch': 0.0}
  0%|                                            | 497/1110346 [41:38<1299:57:07,  4.22s/it]Traceback (most recent call last):
  File "test_train_multilingual.py", line 69, in <module>
    trainer.train()
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py", line 622, in forward
    decoder_outputs = self.decoder(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 957, in forward
    outputs = self.roberta(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 830, in forward
    embedding_output = self.embeddings(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 131, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
  0%|          | 497/1110346 [41:40<1550:51:23,  5.03s/it]

In google colab this CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul is thrown.

The training completes by reducing the filtered dataset to e.g. 510 tokens. Inferencing such a model throws this RuntimeError:

Write a tweet for a new transformer model with hashtags #instruction #nlp #generation #encoder #decoder
Traceback (most recent call last):
  File "test_train_multilingual.py", line 110, in <module>
    test_run(1)
  File "test_train_multilingual.py", line 104, in test_run
    output_ids = multilingualInstructionBERT.sharedModel.generate(input_ids.cuda(), do_sample=True, max_new_tokens=int(maximal_length))
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/generation/utils.py", line 2801, in sample
    outputs = self(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/accelerate/utils/operations.py", line 680, in forward
    return model_forward(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/accelerate/utils/operations.py", line 668, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py", line 622, in forward
    decoder_outputs = self.decoder(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 957, in forward
    outputs = self.roberta(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 837, in forward
    encoder_outputs = self.encoder(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 525, in forward
    layer_outputs = layer_module(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 440, in forward
    cross_attention_outputs = self.crossattention(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 341, in forward
    self_outputs = self.self(
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hilsenbek/workspace/monolingualInstructionBERT/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 277, in forward
    context_layer = torch.matmul(attention_probs, value_layer)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

ArthurZucker commented 3 months ago

Hey IndexError: index out of range in self this is usually when the tokenizer's max encoded token id does not match the size of the embedding matrix. The cuda error might be asyncronus. Print the input ids

Bachstelze commented 2 months ago

Yes, the encoded tokens don't match the size of the embedding matrix. The training was completed by reducing the filtered dataset to e.g. 510 tokens. Inferencing such a trained model throws the stated RuntimeError. Which input ids should I print?

ArthurZucker commented 2 months ago

Just what ever gets passed to the model. You might not have resized correctly. model.resize_embedding() should work

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Bachstelze commented 1 month ago

Why do I have to resize the model if the original size is used? The smaller input size is just for the dataset filtering. Is there somewhere a documentation about model.resize_embedding()?

ArthurZucker commented 1 month ago

Here is the documentation! https://huggingface.co/docs/transformers/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Bachstelze commented 1 week ago

Thanks for the hint @ArthurZucker ! I will try it in the next development iteration.

The stale-bot is annoying. Couldn't it provide useful information and suggestions? And only close the issue if those aren't addressed?

huggingface / transformers