Open stefan-falk opened 4 years ago
@stefan-falk Were you able to resolve this issue?
Using only one GPU resolves this issue (see https://github.com/noahchalifour/rnnt-speech-recognition/issues/18#issuecomment-633862788). But I wouldn't call it "resolved" yet.
same error
@BuaaAlban @prajwaljpj Were you able to run it on multiple GPUs yet?
@BuaaAlban @prajwaljpj Were you able to run it on multiple GPUs yet?
No, and it does't converge
I'm able to run it on 8 v100 GPU in single instance, using TF2.2.0 according to https://github.com/tensorflow/tensorflow/issues/29911 .
The problem now is that it does't converge, the Loss is around 100 after 40 epochs, the batch size is 64, training data is common_voice_data/cv-corpus-6.1-2020-12-11_zh-CN.
Hi!
I am currently trying to start a simple training by following the instructions from the README.md. Everything works up to the point where I want to start the training.
Executing
Throws
Click to expand full error log
``` /home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/librosa/util/decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location. Import requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0. from numba.decorators import jit as optional_jit /home/sfalk/miniconda3/envs/rnnt/lib/python3.8/site-packages/librosa/util/decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location. Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0. from numba.decorators import jit as optional_jit 2020-05-26 09:14:30.736191: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-05-26 09:14:30.748386: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.749173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:30.749232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.750058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: pciBusID: 0000:02:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:30.750112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.750888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties: pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:30.750927: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.751427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties: pciBusID: 0000:05:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:30.751570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-05-26 09:14:30.752638: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-05-26 09:14:30.753673: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-05-26 09:14:30.753866: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-05-26 09:14:30.754997: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-05-26 09:14:30.755618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-05-26 09:14:30.757804: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-05-26 09:14:30.757899: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.759345: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.760097: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.760844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.761589: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.762328: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.763068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.763805: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:30.764521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3 2020-05-26 09:14:30.764770: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-05-26 09:14:30.770070: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 4200000000 Hz 2020-05-26 09:14:30.770492: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560e6150fef0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-05-26 09:14:30.770507: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-05-26 09:14:31.020514: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.037961: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.041811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.049635: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.050189: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560e60e72d20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-05-26 09:14:31.050199: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-05-26 09:14:31.050203: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-05-26 09:14:31.050206: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-05-26 09:14:31.050209: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-05-26 09:14:31.051527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.051949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:31.051989: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.052409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: pciBusID: 0000:02:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:31.052448: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.052867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties: pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:31.052904: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.053326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties: pciBusID: 0000:05:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s 2020-05-26 09:14:31.053353: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-05-26 09:14:31.053366: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-05-26 09:14:31.053377: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-05-26 09:14:31.053387: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-05-26 09:14:31.053397: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-05-26 09:14:31.053407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-05-26 09:14:31.053418: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-05-26 09:14:31.053452: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.053895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.054339: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.054782: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.055227: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.055669: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.056126: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.056579: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.057003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3 2020-05-26 09:14:31.057025: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-05-26 09:14:31.059325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-05-26 09:14:31.059335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 1 2 3 2020-05-26 09:14:31.059340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N Y Y Y 2020-05-26 09:14:31.059344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1: Y N Y Y 2020-05-26 09:14:31.059347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2: Y Y N Y 2020-05-26 09:14:31.059350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3: Y Y Y N 2020-05-26 09:14:31.060102: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.060567: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.061033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.061486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.061942: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.062368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9449 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2020-05-26 09:14:31.062688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.063131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10161 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1) 2020-05-26 09:14:31.063473: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.064690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10161 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1) 2020-05-26 09:14:31.065011: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-26 09:14:31.065455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10161 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1) 4 Physical GPU, 4 Logical GPUs WARNING:tensorflow:From /home/sfalk/tmp/rnnt-speech-recognition/model.py:59: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. W0526 09:14:32.108052 140106746382080 deprecation.py:317] From /home/sfalk/tmp/rnnt-speech-recognition/model.py:59: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. WARNING:tensorflow:Any idea what could be the issue here?
I have the following setup:
I noticed that
FLAGS.gpus
isNone
which leads toMirroredStrategy(devices=gpu_names)
wheregpu_names
isNone
. I am not sure if this has something to do with the issue.