noahchalifour / rnnt-speech-recognition

End-to-end speech recognition using RNN Transducers in Tensorflow 2.0
MIT License
242 stars 79 forks source link

Getting ValueError: Attempt to convert a value (PerReplica ..) when starting training #29

Open stefan-falk opened 4 years ago

stefan-falk commented 4 years ago

Hi!

I am currently trying to start a simple training by following the instructions from the README.md. Everything works up to the point where I want to start the training.

Executing

python run_rnnt.py --mode train --data_dir /home/sfalk/pt/shards/

Throws

ValueError: Attempt to convert a value (PerReplica ..) 
Click to expand full error log ``` /home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/librosa/util/decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location. Import requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0. from numba.decorators import jit as optional_jit /home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/librosa/util/decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location. Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0. from numba.decorators import jit as optional_jit 2020-05-26 09:14:30.736191: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-05-26 09:14:30.748386: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.749173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:30.749232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.750058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: pciBusID: 0000:02:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:30.750112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.750888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties: pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:30.750927: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.751427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties: pciBusID: 0000:05:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:30.751570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-05-26 09:14:30.752638: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-05-26 09:14:30.753673: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-05-26 09:14:30.753866: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-05-26 09:14:30.754997: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-05-26 09:14:30.755618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-05-26 09:14:30.757804: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-05-26 09:14:30.757899: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.759345: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.760097: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.760844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.761589: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.762328: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.763068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.763805: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.764521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3 2020-05-26 09:14:30.764770: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-05-26 09:14:30.770070: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 4200000000 Hz 2020-05-26 09:14:30.770492: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560e6150fef0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-05-26 09:14:30.770507: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-05-26 09:14:31.020514: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.037961: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.041811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.049635: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.050189: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560e60e72d20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-05-26 09:14:31.050199: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-05-26 09:14:31.050203: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-05-26 09:14:31.050206: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-05-26 09:14:31.050209: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-05-26 09:14:31.051527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.051949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:31.051989: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.052409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: pciBusID: 0000:02:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:31.052448: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.052867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties: pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:31.052904: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.053326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties: pciBusID: 0000:05:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:31.053353: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-05-26 09:14:31.053366: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-05-26 09:14:31.053377: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-05-26 09:14:31.053387: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-05-26 09:14:31.053397: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-05-26 09:14:31.053407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-05-26 09:14:31.053418: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-05-26 09:14:31.053452: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.053895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.054339: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.054782: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.055227: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.055669: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.056126: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.056579: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.057003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3 2020-05-26 09:14:31.057025: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-05-26 09:14:31.059325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-05-26 09:14:31.059335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 1 2 3 2020-05-26 09:14:31.059340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N Y Y Y 2020-05-26 09:14:31.059344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1: Y N Y Y 2020-05-26 09:14:31.059347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2: Y Y N Y 2020-05-26 09:14:31.059350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3: Y Y Y N 2020-05-26 09:14:31.060102: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.060567: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.061033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.061486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.061942: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.062368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9449 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2020-05-26 09:14:31.062688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.063131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10161 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1) 2020-05-26 09:14:31.063473: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.064690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10161 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1) 2020-05-26 09:14:31.065011: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.065455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10161 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1) 4 Physical GPU, 4 Logical GPUs WARNING:tensorflow:From /home/sfalk/tmp/rnnt-speech-recognition/model.py:59: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. W0526 09:14:32.108052 140106746382080 deprecation.py:317] From /home/sfalk/tmp/rnnt-speech-recognition/model.py:59: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:32.108385 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:From /home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/ops/rnn_cell_impl.py:962: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.add_weight` method instead. W0526 09:14:32.109819 140106746382080 deprecation.py:317] From /home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/ops/rnn_cell_impl.py:962: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.add_weight` method instead. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:32.227335 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:32.490125 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:32.669947 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:32.804272 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:32.951039 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:33.074690 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:33.202479 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:33.890956 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. WARNING:tensorflow:: Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. W0526 09:14:34.015121 140106746382080 rnn_cell_impl.py:909] : Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU. I0526 09:14:34.344151 140106746382080 run_rnnt.py:490] Using word-piece encoder with vocab size: 4341 Model: "encoder" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, None, 240)] 0 _________________________________________________________________ batch_normalization (BatchNo (None, None, 240) 960 _________________________________________________________________ rnn (RNN) (None, None, 640) 8527872 _________________________________________________________________ dropout (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization (LayerNo (None, None, 640) 1280 _________________________________________________________________ rnn_1 (RNN) (None, None, 640) 11804672 _________________________________________________________________ dropout_1 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_1 (Layer (None, None, 640) 1280 _________________________________________________________________ time_reduction (TimeReductio (None, None, 1280) 0 _________________________________________________________________ rnn_2 (RNN) (None, None, 640) 17047552 _________________________________________________________________ dropout_2 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_2 (Layer (None, None, 640) 1280 _________________________________________________________________ rnn_3 (RNN) (None, None, 640) 11804672 _________________________________________________________________ dropout_3 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_3 (Layer (None, None, 640) 1280 _________________________________________________________________ rnn_4 (RNN) (None, None, 640) 11804672 _________________________________________________________________ dropout_4 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_4 (Layer (None, None, 640) 1280 _________________________________________________________________ rnn_5 (RNN) (None, None, 640) 11804672 _________________________________________________________________ dropout_5 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_5 (Layer (None, None, 640) 1280 _________________________________________________________________ rnn_6 (RNN) (None, None, 640) 11804672 _________________________________________________________________ dropout_6 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_6 (Layer (None, None, 640) 1280 _________________________________________________________________ rnn_7 (RNN) (None, None, 640) 11804672 _________________________________________________________________ dropout_7 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_7 (Layer (None, None, 640) 1280 ================================================================= Total params: 96,414,656 Trainable params: 96,414,176 Non-trainable params: 480 _________________________________________________________________ Model: "prediction_network" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) [(None, None)] 0 _________________________________________________________________ embedding (Embedding) (None, None, 500) 2170500 _________________________________________________________________ rnn_8 (RNN) (None, None, 640) 10657792 _________________________________________________________________ dropout_8 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_8 (Layer (None, None, 640) 1280 _________________________________________________________________ rnn_9 (RNN) (None, None, 640) 11804672 _________________________________________________________________ dropout_9 (Dropout) (None, None, 640) 0 _________________________________________________________________ layer_normalization_9 (Layer (None, None, 640) 1280 ================================================================= Total params: 24,635,524 Trainable params: 24,635,524 Non-trainable params: 0 _________________________________________________________________ Model: "transducer" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== mel_specs (InputLayer) [(None, None, 240)] 0 __________________________________________________________________________________________________ pred_inp (InputLayer) [(None, None)] 0 __________________________________________________________________________________________________ encoder (Model) (None, None, 640) 96414656 mel_specs[0][0] __________________________________________________________________________________________________ prediction_network (Model) (None, None, 640) 24635524 pred_inp[0][0] __________________________________________________________________________________________________ tf_op_layer_ExpandDims (TensorF [(None, None, 1, 640 0 encoder[1][0] __________________________________________________________________________________________________ tf_op_layer_ExpandDims_1 (Tenso [(None, 1, None, 640 0 prediction_network[1][0] __________________________________________________________________________________________________ tf_op_layer_AddV2 (TensorFlowOp [(None, None, None, 0 tf_op_layer_ExpandDims[0][0] tf_op_layer_ExpandDims_1[0][0] __________________________________________________________________________________________________ dense (Dense) (None, None, None, 6 410240 tf_op_layer_AddV2[0][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, None, None, 4 2782581 dense[0][0] ================================================================================================== Total params: 124,243,001 Trainable params: 124,242,521 Non-trainable params: 480 __________________________________________________________________________________________________ Starting training. Performing evaluation. Traceback (most recent call last): File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2292, in _convert_inputs_to_signature flatten_inputs[index] = ops.convert_to_tensor( File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1341, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 321, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 261, in constant return _constant_impl(value, dtype, shape, name, verify_shape=False, File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 270, in _constant_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 96, in convert_to_eager_tensor return ops.EagerTensor(value, ctx.device_name, dtype) ValueError: Attempt to convert a value (PerReplica:{ 0: , 1: , 2: , 3: }) with an unsupported type () to a Tensor. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "run_rnnt.py", line 586, in app.run(main) File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "run_rnnt.py", line 532, in main run_training( File "run_rnnt.py", line 347, in run_training checkpoint_model() File "run_rnnt.py", line 304, in checkpoint_model eval_loss, eval_metrics_results = run_evaluate( File "run_rnnt.py", line 433, in run_evaluate loss, metrics_results = eval_step(inputs) File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__ result = self._call(*args, **kwds) File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 647, in _call self._stateful_fn._function_spec.canonicalize_function_inputs( # pylint: disable=protected-access File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2235, in canonicalize_function_inputs inputs = _convert_inputs_to_signature( File "/home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2296, in _convert_inputs_to_signature raise ValueError("When input_signature is provided, all inputs to " ValueError: When input_signature is provided, all inputs to the Python function must be convertible to tensors: inputs: ( (PerReplica:{ 0: , 1: , 2: , 3: }, PerReplica:{ 0: , 1: , 2: , 3: }, PerReplica:{ 0: , 1: , 2: , 3: }, PerReplica:{ 0: , 1: , 2: , 3: }, PerReplica:{ 0: , 1: , 2: , 3: })) input_signature: ( [TensorSpec(shape=(None, None, 240), dtype=tf.float32, name=None), TensorSpec(shape=(None, None), dtype=tf.int32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None), TensorSpec(shape=(None, None), dtype=tf.int32, name=None)]) ```

Any idea what could be the issue here?

I have the following setup:

$ pip freeze | grep tensor
tensorboard==2.2.1
tensorboard-plugin-wit==1.6.0.post3
tensorflow-datasets==3.1.0
tensorflow-estimator==2.2.0
tensorflow-gpu==2.2.0
tensorflow-metadata==0.22.0
warprnnt-tensorflow==0.1

I noticed that FLAGS.gpus is None which leads to MirroredStrategy(devices=gpu_names) where gpu_names is None. I am not sure if this has something to do with the issue.

prajwaljpj commented 4 years ago

@stefan-falk Were you able to resolve this issue?

stefan-falk commented 4 years ago

Using only one GPU resolves this issue (see https://github.com/noahchalifour/rnnt-speech-recognition/issues/18#issuecomment-633862788). But I wouldn't call it "resolved" yet.

BuaaAlban commented 4 years ago

same error

stefan-falk commented 4 years ago

@BuaaAlban @prajwaljpj Were you able to run it on multiple GPUs yet?

BuaaAlban commented 4 years ago

@BuaaAlban @prajwaljpj Were you able to run it on multiple GPUs yet?

No, and it does't converge

Chen188 commented 3 years ago

I'm able to run it on 8 v100 GPU in single instance, using TF2.2.0 according to https://github.com/tensorflow/tensorflow/issues/29911 .

The problem now is that it does't converge, the Loss is around 100 after 40 epochs, the batch size is 64, training data is common_voice_data/cv-corpus-6.1-2020-12-11_zh-CN.