keras-team / keras-io

Keras documentation, hosted live at keras.io
Apache License 2.0
2.76k stars 2.04k forks source link

Error in neural_machine_translation_with_transformer.py. English-to-Spanish translation with a sequence-to-sequence Transformer #1074

Closed SanJoseCosta closed 7 months ago

SanJoseCosta commented 2 years ago

python3 put/es.py

2022-09-19 17:20:53.574026: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-19 17:20:53.725590: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-19 17:20:53.725630: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-09-19 17:20:53.761226: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-19 17:20:54.699879: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-09-19 17:20:54.699975: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-09-19 17:20:54.699992: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
2638744/2638744 [==============================] - 0s 0us/step
("Tell me when you're done.", '[start] Avisame cuando termines. [end]')
('Reading develops the mind.', '[start] Leer desarrolla la mente. [end]')
('When will his new novel be published?', '[start] ¿Cuándo se publica su nueva novela? [end]')
('Mary has her back to us.', '[start] Mary nos da la espalda. [end]')
('He is our teacher of English.', '[start] Él es nuestro profesor de inglés. [end]')
118964 total pairs
83276 training pairs
17844 validation pairs
17844 test pairs
2022-09-19 17:20:56.153923: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-09-19 17:20:56.153968: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-09-19 17:20:56.153991: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-10-0-0-90.ec2.internal): /proc/driver/nvidia/version does not exist
2022-09-19 17:20:56.154559: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
inputs["encoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"].shape: (64, 20)
targets.shape: (64, 20)
2022-09-19 17:21:41.569386: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 encoder_inputs (InputLayer)    [(None, None)]       0           []                               

 positional_embedding (Position  (None, None, 256)   3845120     ['encoder_inputs[0][0]']         
 alEmbedding)                                                                                     

 decoder_inputs (InputLayer)    [(None, None)]       0           []                               

 transformer_encoder (Transform  (None, None, 256)   3155456     ['positional_embedding[0][0]']   
 erEncoder)                                                                                       

 model_1 (Functional)           (None, None, 15000)  12959640    ['decoder_inputs[0][0]',         
                                                                  'transformer_encoder[0][0]']    

==================================================================================================
Total params: 19,960,216
Trainable params: 19,960,216
Non-trainable params: 0
__________________________________________________________________________________________________
Traceback (most recent call last):
  File "put/es.py", line 408, in <module>
    transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 55, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'gradient_tape/transformer/transformer_encoder/multi_head_attention/softmax/add/BroadcastGradientArgs' defined at (most recent call last):
    File "put/es.py", line 408, in <module>
      transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 1564, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 1160, in train_function
      return step_function(self, iterator)
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 1146, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 1135, in run_step
      outputs = model.train_step(data)
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 997, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 577, in minimize
      loss, var_list=var_list, grad_loss=grad_loss, tape=tape
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 635, in _compute_gradients
      tape, loss, var_list, grad_loss
    File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 510, in _get_gradients
      grads = tape.gradient(loss, var_list, grad_loss)
Node: 'gradient_tape/transformer/transformer_encoder/multi_head_attention/softmax/add/BroadcastGradientArgs'
Incompatible shapes: [64,8,20,20] vs. [64,64,20,20]
     [[{{node gradient_tape/transformer/transformer_encoder/multi_head_attention/softmax/add/BroadcastGradientArgs}}]] [Op:__inference_train_function_16963]
2022-09-19 17:21:50.937059: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
ymcki commented 1 year ago

I also crashed with an incompatible shape error

Epoch 1/30 Traceback (most recent call last): File "transformer_train.py", line 408, in transformer.fit(train_ds, epochs=epochs, validation_data=val_ds) File "/opt/miniconda3/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/opt/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'gradient_tape/transformer/transformer_encoder/multi_head_attention/softmax/add/BroadcastGradientArgs' defined at (most recent call last): File "transformer_train.py", line 408, in transformer.fit(train_ds, epochs=epochs, validation_data=val_ds) File "/opt/miniconda3/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "/opt/miniconda3/lib/python3.8/site-packages/keras/engine/training.py", line 1564, in fit tmp_logs = self.train_function(iterator) File "/opt/miniconda3/lib/python3.8/site-packages/keras/engine/training.py", line 1160, in train_function return step_function(self, iterator) File "/opt/miniconda3/lib/python3.8/site-packages/keras/engine/training.py", line 1146, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/opt/miniconda3/lib/python3.8/site-packages/keras/engine/training.py", line 1135, in run_step outputs = model.train_step(data) File "/opt/miniconda3/lib/python3.8/site-packages/keras/engine/training.py", line 997, in train_step self.optimizer.minimize(loss, self.trainable_variables, tape=tape) File "/opt/miniconda3/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 576, in minimize grads_and_vars = self._compute_gradients( File "/opt/miniconda3/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 634, in _compute_gradients grads_and_vars = self._get_gradients( File "/opt/miniconda3/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 510, in _get_gradients grads = tape.gradient(loss, var_list, grad_loss) Node: 'gradient_tape/transformer/transformer_encoder/multi_head_attention/softmax/add/BroadcastGradientArgs' Incompatible shapes: [64,8,20,20] vs. [64,64,20,20]

BenjaminMickler commented 1 year ago

I have the same problem as SanJoseCosta. It works in Google Colab (Tensorflow version 2.9.2) but not on my own computer running tensorflow 2.11.0.

2022-12-08 05:53:18.266030: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead.

@fchollet Would you please help?

BenjaminMickler commented 1 year ago

I just installed tensorflow version 2.9.2 and it works. I still get this warning, so it must not be the problem: 2022-12-08 06:51:10.808714: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead.

With 2.11.0 I get this error just before the above warning:

File "/home/ben/miniconda3/envs/tf/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
  outputs = call_fn(inputs, *args, **kwargs)
File "/home/ben/miniconda3/envs/tf/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
  return fn(*args, **kwargs)
File "/home/ben/miniconda3/envs/tf/lib/python3.10/site-packages/keras/layers/activation/softmax.py", line 95, in call
  inputs += adder

Node: 'transformer/transformer_encoder/multi_head_attention/softmax/add' 2 root error(s) found. (0) INVALID_ARGUMENT: required broadcastable shapes [[{{node transformer/transformer_encoder/multi_head_attention/softmax/add}}]] [[broadcast_weights_1/assert_broadcastable/is_valid_shape/else/_1/broadcast_weights_1/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/then/_53/broadcast_weights_1/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat/_94]] (1) INVALID_ARGUMENT: required broadcastable shapes [[{{node transformer/transformer_encoder/multi_head_attention/softmax/add}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_17025]

BenjaminMickler commented 1 year ago

I don't want to be stuck on an old version of Tensorflow for ever more though, so it would be great if someone would help me get it working even though I have a temporary solution.

sachinprasadhs commented 8 months ago

Hi,

The example has been updated to Keras 3 and it is working fine without any issue.

Here is the link to the updated tutorial: https://keras.io/examples/nlp/neural_machine_translation_with_transformer/

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.