ChristophKnapp commented 9 months ago

System Info

2023-11-15 07:10:34.350171: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-11-15 07:10:34.489992: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-11-15 07:10:34.490040: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-11-15 07:10:34.490749: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-11-15 07:10:34.544177: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-11-15 07:10:35.040462: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/transformers/commands/env.py:100: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use tf.config.list_physical_devices('GPU') instead. 2023-11-15 07:10:36.233180: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-11-15 07:10:36.235004: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2211] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.34.0
Platform: Linux-6.2.0-36-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.17.3
Safetensors version: 0.4.0
Accelerate version: 0.23.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (True)
Tensorflow version (GPU?): 2.14.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Setup a python environment in pycharm.
Add transformer example script for translation from englisch to romanian,
Install python libraries from within pycharm.
Install transformers development version as requested by script.
Run script, after first epoch error is thrown.

Expected behavior

I'm running into this problem when I run the English to Romania translation example. I'm not aware that I modified anything in the script. It fits the model up to the first epoch then it throws this error. There are already two issue reports with this problem nobody felt responsible to take on. I pasted this as a comment in one of them. Given that I was not sure whether the old issue is reopened, I decided to create a new one.

I will debug this on my own but any help is appreciated.

Regards

2023-11-13 15:47:58.542480: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-11-13 15:47:58.564058: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-11-13 15:47:58.564080: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-11-13 15:47:58.564097: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-11-13 15:47:58.568038: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 11/13/2023 15:47:59 - INFO - main - Training/evaluation parameters TFTrainingArguments( _n_gpu=-1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gcp_project=None, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/workspace/transformer/results/runs/Nov13_15-47-59_workstation-bluechip-BUSINESSline-individu, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=500, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=/workspace/transformer/results, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=16, per_device_train_batch_size=16, poly_power=1.0, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/workspace/transformer/results, save_on_each_node=False, save_safetensors=True, save_steps=500, save_strategy=steps, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_name=None, tpu_num_cores=None, tpu_zone=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xla=False, ) Loading Dataset Infos from /.cache/huggingface/modules/datasets_modules/datasets/wmt16/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227 Overwrite dataset info from restored data version if exists. Loading Dataset info from /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227 11/13/2023 15:48:01 - INFO - datasets.info - Loading Dataset Infos from /.cache/huggingface/modules/datasets_modules/datasets/wmt16/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227 11/13/2023 15:48:01 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists. 11/13/2023 15:48:01 - INFO - datasets.info - Loading Dataset info from /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227 11/13/2023 15:48:01 - INFO - datasets.builder - Found cached dataset wmt16 (/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227) 11/13/2023 15:48:01 - INFO - datasets.info - Loading Dataset info from /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227 Found cached dataset wmt16 (/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227) Loading Dataset info from /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227 loading configuration file config.json from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/config.json Model config T5Config { "_name_or_path": "t5-small", "architectures": [ "T5ForConditionalGeneration" ], "classifier_dropout": 0.0, "d_ff": 2048, "d_kv": 64, "d_model": 512, "decoder_start_token_id": 0, "dense_act_fn": "relu", "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "relu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": false, "layer_norm_epsilon": 1e-06, "model_type": "t5", "n_positions": 512, "num_decoder_layers": 6, "num_heads": 8, "num_layers": 6, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "task_specific_params": { "summarization": { "early_stopping": true, "length_penalty": 2.0, "max_length": 200, "min_length": 30, "no_repeat_ngram_size": 3, "num_beams": 4, "prefix": "summarize: " }, "translation_en_to_de": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to German: " }, "translation_en_to_fr": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to French: " }, "translation_en_to_ro": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to Romanian: " } }, "transformers_version": "4.36.0.dev0", "use_cache": true, "vocab_size": 32128 }

loading file spiece.model from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/spiece.model loading file tokenizer.json from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/tokenizer_config.json Loading cached processed dataset at /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-164eb734af318539.arrow Loading cached processed dataset at /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-442e2020e92ebe8e.arrow Tensorflow: setting up strategy 11/13/2023 15:48:01 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-164eb734af318539.arrow 11/13/2023 15:48:01 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-442e2020e92ebe8e.arrow 2023-11-13 15:48:01.416190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8825 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:01:00.0, compute capability: 8.6 loading weights file model.safetensors from cache at /.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/model.safetensors Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0 }

2023-11-13 15:48:01.656874: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory Loaded 60,506,624 parameters in the TF 2.0 model. All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training. You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the call method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding. No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss. You can also specify loss='auto' to get the internal loss without printing this info string. 11/13/2023 15:48:04 - INFO - main - Running training 11/13/2023 15:48:04 - INFO - main - Num examples = 610320 11/13/2023 15:48:04 - INFO - main - Num Epochs = 3.0 11/13/2023 15:48:04 - INFO - main - Instantaneous batch size per device = 16 11/13/2023 15:48:04 - INFO - main - Total train batch size = 16 11/13/2023 15:48:04 - INFO - main - Total optimization steps = 114435 Epoch 1/3 2023-11-13 15:48:13.749879: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f01b9364620 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-11-13 15:48:13.749896: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6 2023-11-13 15:48:13.752234: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable. 2023-11-13 15:48:13.759242: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8700 2023-11-13 15:48:13.802724: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. 38145/38145 [==============================] - ETA: 0s - loss: 0.6117Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0 }

Traceback (most recent call last): File "/workspace/transformer/run_translation.py", line 733, in main() File "/workspace/transformer/run_translation.py", line 693, in main history = model.fit(tf_train_dataset, epochs=int(training_args.num_train_epochs), callbacks=callbacks) File "/workspace/transformer/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/workspace/transformer/lib/python3.10/site-packages/transformers/keras_callbacks.py", line 223, in on_epoch_end predictions = self.generation_function(generation_inputs, attention_mask=attention_mask) File "/tmp/autograph_generated_fileg5wrw6ci.py", line 13, in tfgenerationfunction retval = ag.converted_call(ag.ld(self).model.generate, (ag.ld(inputs),), dict(attention_mask=ag.ld(attention_mask), **ag.ld(self).generate_kwargs), fscope) File "/tmp/autograph_generated_fileqqh0lf7s.py", line 437, in tfgenerate is_beam_genmode = ag.and(lambda : ag.not_(ag.ld(is_contrastive_search_gen_mode)), lambda : ag.and_(lambda : ag.ld(generation_config).num_beams > 1, lambda : ag.ld(generation_config).do_sample is False)) File "/tmp/autograph_generated_fileqqh0lf7s.py", line 437, in is_beam_genmode = ag.and(lambda : ag_.not(ag.ld(is_contrastive_search_gen_mode)), lambda : ag.and_(lambda : ag.ld(generation_config).num_beams > 1, lambda : ag.ld(generation_config).do_sample is False)) File "/tmp/autograph_generated_fileqqh0lf7s.py", line 437, in is_beam_genmode = ag.and(lambda : ag_.not(ag.ld(is_contrastive_search_gen_mode)), lambda : ag_.and(lambda : ag.ld(generation_config).num_beams > 1, lambda : ag.ld(generation_config).do_sample is False)) TypeError: in user code:

File "/workspace/transformer/lib/python3.10/site-packages/transformers/keras_callbacks.py", line 202, in generation_function * return self.model.generate(inputs, attention_mask=attention_mask, *self.generate_kwargs) File "/workspace/transformer/lib/python3.10/site-packages/transformers/generation/tf_utils.py", line 884, in generate is_beam_gen_mode = (

TypeError: '>' not supported between instances of 'NoneType' and 'int'

Process finished with exit code 1

ChristophKnapp commented 9 months ago

--max_train_samples 500 --max_eval_samples 500 --max_predict_samples 500

reduces waiting time for this error to appear to a minute. This works for me, further reduction does not seem to help much.

amyeroberts commented 9 months ago

cc @Rocketknight1 as it seems to be failing on a TF script

Rocketknight1 commented 9 months ago

Hi @ChristophKnapp, can you confirm that you're running the translation example from here and paste me the exact command you used to run it so I can reproduce?

ChristophKnapp commented 9 months ago

@Rocketknight1 Yes that's the script I'm using. The terminal options are:

--output_dir /workspace/results --model_name_or_path t5-small --do_train --do_eval --source_lang en --target_lang ro --source_prefix translate_English_toRomanian: --dataset_name wmt16 --dataset_config_name ro-en --per_device_train_batch_size=16 --per_device_eval_batch_size=16 --overwrite_output_dir --max_train_samples 500 --max_eval_samples 500 --max_predict_samples 500

except of the last three input values, that should be exactly as recomended on the example md file.

Rocketknight1 commented 9 months ago

Confirmed the issue here - the problem is this code:

is_beam_gen_mode = (
            not is_contrastive_search_gen_mode
            and (generation_config.num_beams > 1)
            and generation_config.do_sample is False
        )

The problem is that in this case, generation_config does not have a num_beams attribute and so we get a value of None. I also see a couple of other issues in this example script that should be fixed.

@gante are you okay if I open a PR to replace this with something like getattr(generation_config, "num_beams", 1)?

Rocketknight1 commented 9 months ago

@ChristophKnapp thank you for the bug report! We've opened a PR to fix it at #27519. The updated example script is here - please try it and let me know if the issue is resolved!

ChristophKnapp commented 9 months ago

@ChristophKnapp thank you for the bug report! We've opened a PR to fix it at #27519. The updated example script is here - please try it and let me know if the issue is resolved!

Thanks a lot for all your help. The script finishes with no errors now.

Rocketknight1 commented 9 months ago

No probs! We try to run tests, but sometimes bug reports like these are the only way we find out about issues like this. It was probably affecting lots of people, not just you. Thanks for letting us know!

Vinodvs34 commented 8 months ago

huggingface / transformers

TypeError: '>' not supported between instances of 'NoneType' and 'int' #27505

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

727