iamjanvijay / rnnt

An implementation of RNN-Transducer loss in TF-2.0.
MIT License
45 stars 9 forks source link

Dimension error #3

Closed jtdutta1 closed 4 years ago

jtdutta1 commented 4 years ago
2020-07-10 09:52:09.715599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-10 09:52:13.875948: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-10 09:52:13.910317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:08:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.23GiB/s
2020-07-10 09:52:13.910441: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-10 09:52:13.966109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-10 09:52:14.009304: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-10 09:52:14.028718: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-10 09:52:14.068297: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-10 09:52:14.094965: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-10 09:52:14.171005: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-10 09:52:14.171229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-10 09:52:14.173103: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-10 09:52:14.199127: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x18498c47600 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-10 09:52:14.199354: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-10 09:52:14.200187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:08:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.23GiB/s
2020-07-10 09:52:14.200339: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-10 09:52:14.200510: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-10 09:52:14.200624: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-10 09:52:14.200728: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-10 09:52:14.200834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-10 09:52:14.200937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-10 09:52:14.201032: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-10 09:52:14.201209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-10 09:52:15.580864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-10 09:52:15.581199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-07-10 09:52:15.581289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-07-10 09:52:15.582336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6609 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:08:00.0, compute capability: 7.5)
2020-07-10 09:52:15.586161: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x184c2fd8b50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-10 09:52:15.586307: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
Model: "EncoderModel"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         [(4, None, 20)]           0
_________________________________________________________________
EncoderBlock (EncoderBlock)  (4, None, 100)            370000
=================================================================
Total params: 370,000
Trainable params: 370,000
Non-trainable params: 0
_________________________________________________________________
None
Model: "PredictorModel"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_2 (InputLayer)         [(4, None, 28)]           0
_________________________________________________________________
PredictionBlock (PredictionB (4, None, 100)            212400
=================================================================
Total params: 212,400
Trainable params: 212,400
Non-trainable params: 0
_________________________________________________________________
None
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            [(4, None, 20)]      0
__________________________________________________________________________________________________
input_2 (InputLayer)            [(4, None, 28)]      0
__________________________________________________________________________________________________
EncoderBlock (EncoderBlock)     (4, None, 100)       370000      input_1[0][0]
__________________________________________________________________________________________________
PredictionBlock (PredictionBloc (4, None, 100)       212400      input_2[0][0]
__________________________________________________________________________________________________
tf_op_layer_ExpandDims (TensorF [(4, None, 1, 100)]  0           EncoderBlock[0][0]
__________________________________________________________________________________________________
tf_op_layer_ExpandDims_1 (Tenso [(4, 1, None, 100)]  0           PredictionBlock[0][0]
__________________________________________________________________________________________________
tf_op_layer_AddV2 (TensorFlowOp [(4, None, None, 100 0           tf_op_layer_ExpandDims[0][0]
                                                                 tf_op_layer_ExpandDims_1[0][0]
__________________________________________________________________________________________________
time_distributed (TimeDistribut (None, None, None, 1 10100       tf_op_layer_AddV2[0][0]
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, None, 2 2828        time_distributed[0][0]
==================================================================================================
Total params: 595,328
Trainable params: 595,328
Non-trainable params: 0
__________________________________________________________________________________________________
None
Epoch 1/10
(4, 391, 172, 28)
(4, 172)
(4, 1)
(4, 1)
Traceback (most recent call last):
  File "run_model.py", line 74, in <module>
    train(t_model)
  File "run_model.py", line 67, in train
    loss = train_step(t_model, t_data, optimizer)
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\eager\def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\eager\def_function.py", line 627, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\eager\def_function.py", line 506, in _initialize
    *args, **kwds))
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\eager\function.py", line 2446, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\eager\function.py", line 2777, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\eager\function.py", line 2667, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\framework\func_graph.py", line 981, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\eager\def_function.py", line 441, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\framework\func_graph.py", line 968, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    run_model.py:54 train_step  *
        loss = loss_fn(logits, labels, label_lens, mfcc_lens)
    run_model.py:42 loss_fn  *
        return rnnt_loss(logits, labels, label_length, logit_length)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\rnnt-0.0.5-py3.7.egg\rnnt\rnnt.py:195 compute_rnnt_loss_and_grad  *
        result = compute_rnnt_loss_and_grad_helper(**kwargs)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\rnnt-0.0.5-py3.7.egg\rnnt\rnnt.py:112 compute_rnnt_loss_and_grad_helper  *
        blank_probs, truth_probs = transition_probs(one_hot_labels, log_probs)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\rnnt-0.0.5-py3.7.egg\rnnt\rnnt.py:36 transition_probs  *
        truth_probs = tf.reduce_sum(tf.multiply(log_probs[:, :, :-1, :], one_hot_labels), axis=-1)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\util\dispatch.py:180 wrapper  **
        return target(*args, **kwargs)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\ops\math_ops.py:381 multiply
        return gen_math_ops.mul(x, y, name)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\ops\gen_math_ops.py:6092 mul
        "Mul", x=x, y=y, name=name)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\framework\op_def_library.py:744 _apply_op_helper
        attrs=attr_protos, op_def=op_def)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\framework\func_graph.py:595 _create_op_internal
        compute_device)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\framework\ops.py:3327 _create_op_internal
        op_def=op_def)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\framework\ops.py:1817 __init__
        control_input_ops, op_def)
    C:\Users\jtdut\anaconda3\envs\rnnt\lib\site-packages\tensorflow\python\framework\ops.py:1657 _create_c_op
        raise ValueError(str(e))

    ValueError: Dimensions must be equal, but are 171 and 172 for '{{node rnnt_loss/Mul}} = Mul[T=DT_FLOAT](rnnt_loss/strided_slice_1, rnnt_loss/one_hot)' with input shapes: [4,391,171,28], [4,391,172,28].

I got this error.

iamjanvijay commented 4 years ago

Hi Jeet!

Can you let me know the dimensions of logits, labels, label_length and logit_length which you have passed in rnnt_loss call?

jtdutta1 commented 4 years ago

Hi, All the info is actually given in the log I provided. Sorry I didn't label it properly, but here it goes.:- logits: (4, 391, 172, 28) labels: (4, 172) label_length: (4, 1) logit_length: (4, 1)

Thanks for the fast reply.

iamjanvijay commented 4 years ago

I see two issues here.

Firstly, reshape _labellength and _logitlength to (4). Secondly, if your labels have a maximum sequence length (U) of 172. Then, logits should be of the shape (4, 391, 173, 28). Logits should have 3rd dimension as U+1 since then the prediction network operates of [0] + [label_sequence_ids], i.e., with a blank symbol (0) prepended to the actual sequence.

Also, note that ids in labels should be in [1, 28] and not [0, 27]. Since loss_function assumes that index 0 is reserved for the blank symbol. Plus, labels and _labellength exclude prepended blank symbol so their contents should correspond to length U only.

I'll update these in the README.

jtdutta1 commented 4 years ago

Thank You! I'll include these changes and will let you know