Open galv opened 4 years ago
This has to do with our custom implementation of https://github.com/tensorflow/lingvo/blob/master/lingvo/core/recurrent.py which processes the tensors one timestep at a time.
Conversion is very straightforward
# Assuming padding is [max_sequence_length, batch_size]
max_length = tf.shape(padding)[0]
lengths = tf.reduce_sum(1.0 - padding, axis=0)
padding = 1.0 - tf.sequence_mask(lengths, max_length, tf.float32)
I've noticed that the rnn code in lingvo (in particular for ASR tasks) uses bitmasks for representing the validity of data in a sequence when you do batching.
To be explicit, if you have,
data
, a tensor of (max_sequence_length, batch_size, feature_size), thenpadding
is a tensor of (max_sequence_length, batch_size). The vector atdata[i][j]
is valid only ifpadding[i][j] == 1
Meanwhile, in the rest of deep learning land, I've always seen the validity of different-length-sequences in a minibatch described via a "length" tensor, of shape (batch_size, ). This obviously uses a lot less memory.
What is the motivation behind this? If I had to guess, it is for making TPUs happy, but I don't know enough about TPU microarchitecture to know. I would like to create a ctc model on top of lingvo, but the ctc loss function https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss requires a length tensor, rather than a bitmask tensor. Do you have any recommendations for how to convert the bitmask padding representation to a length padding representation?