Closed prabhat-123 closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I'm running into this problem when I run the english to romania translation example. I'm not aware that I modified anything in the script. It fits the model up to the first epoch then it throws this error.
2023-11-13 15:47:58.542480: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2023-11-13 15:47:58.564058: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-13 15:47:58.564080: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-13 15:47:58.564097: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-13 15:47:58.568038: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
11/13/2023 15:47:59 - INFO - main - Training/evaluation parameters TFTrainingArguments(
_n_gpu=-1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gcp_project=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=
loading file spiece.model from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/spiece.model loading file tokenizer.json from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/tokenizer_config.json Loading cached processed dataset at /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-164eb734af318539.arrow Loading cached processed dataset at /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-442e2020e92ebe8e.arrow Tensorflow: setting up strategy 11/13/2023 15:48:01 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-164eb734af318539.arrow 11/13/2023 15:48:01 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-442e2020e92ebe8e.arrow 2023-11-13 15:48:01.416190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8825 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:01:00.0, compute capability: 8.6 loading weights file model.safetensors from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/model.safetensors Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0 }
2023-11-13 15:48:01.656874: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory Loaded 60,506,624 parameters in the TF 2.0 model. All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.
All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__
method is faster than using a method to encode the text followed by a call to the pad
method to get a padded encoding.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass loss=None
if you do not want your model to compute a loss. You can also specify loss='auto'
to get the internal loss without printing this info string.
11/13/2023 15:48:04 - INFO - main - Running training
11/13/2023 15:48:04 - INFO - main - Num examples = 610320
11/13/2023 15:48:04 - INFO - main - Num Epochs = 3.0
11/13/2023 15:48:04 - INFO - main - Instantaneous batch size per device = 16
11/13/2023 15:48:04 - INFO - main - Total train batch size = 16
11/13/2023 15:48:04 - INFO - main - Total optimization steps = 114435
Epoch 1/3
2023-11-13 15:48:13.749879: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f01b9364620 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-13 15:48:13.749896: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
2023-11-13 15:48:13.752234: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY
to enable.
2023-11-13 15:48:13.759242: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8700
2023-11-13 15:48:13.802724: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
38145/38145 [==============================] - ETA: 0s - loss: 0.6117Generate config GenerationConfig {
"decoder_start_token_id": 0,
"eos_token_id": 1,
"pad_token_id": 0
}
Traceback (most recent call last):
File "/workspace/transformer/run_translation.py", line 733, in
File "/workspace/transformer/lib/python3.10/site-packages/transformers/keras_callbacks.py", line 202, in generation_function *
return self.model.generate(inputs, attention_mask=attention_mask, **self.generate_kwargs)
File "/workspace/transformer/lib/python3.10/site-packages/transformers/generation/tf_utils.py", line 884, in generate *
is_beam_gen_mode = (
TypeError: '>' not supported between instances of 'NoneType' and 'int'
Process finished with exit code 1
@ChristophKnapp Thanks for opening a new issue. Linking here for reference #27505
Environment info
transformers
version:Who can help
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
1. 2. 3.
Expected behavior