tensorflow / lingvo

Apache License 2.0
2.82k stars 445 forks source link

Why does lingvo use a binary mask instead of sequence lengths for representing invalid regions of sequences? #227

Open galv opened 4 years ago

galv commented 4 years ago

I've noticed that the rnn code in lingvo (in particular for ASR tasks) uses bitmasks for representing the validity of data in a sequence when you do batching.

To be explicit, if you have, data, a tensor of (max_sequence_length, batch_size, feature_size), then padding is a tensor of (max_sequence_length, batch_size). The vector at data[i][j] is valid only if padding[i][j] == 1

Meanwhile, in the rest of deep learning land, I've always seen the validity of different-length-sequences in a minibatch described via a "length" tensor, of shape (batch_size, ). This obviously uses a lot less memory.

What is the motivation behind this? If I had to guess, it is for making TPUs happy, but I don't know enough about TPU microarchitecture to know. I would like to create a ctc model on top of lingvo, but the ctc loss function https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss requires a length tensor, rather than a bitmask tensor. Do you have any recommendations for how to convert the bitmask padding representation to a length padding representation?

jonathanasdf commented 4 years ago

This has to do with our custom implementation of https://github.com/tensorflow/lingvo/blob/master/lingvo/core/recurrent.py which processes the tensors one timestep at a time.

Conversion is very straightforward

# Assuming padding is [max_sequence_length, batch_size]
max_length = tf.shape(padding)[0]
lengths = tf.reduce_sum(1.0 - padding, axis=0)
padding = 1.0 - tf.sequence_mask(lengths, max_length, tf.float32)