Open mrgloom opened 7 years ago
Some info described here, but it still not very clear for me:
So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function.
Hi @mrgloom , you can use either width or height as your ''time dimension''. Using the width you will perform a row-wise scan, otherwise you will perform a column-wise scan. Also, you can apply conv layers before the LSTM network followed by a Global Average Pooling, returning a tensor with shape [batch_size, feature_map_height, feature_map_width]
I'm trying to solve OCR tasks based on this code.
So what shape input to LSTM should have, suppose we have images
[batch_size, height, width, channels]
how should they be reshaped to be used as input? Like[batch_size, width, height*channels]
, sowidth
is liketime dimension
?What if I want to have variable width? As I understand size of sequences in batch should be the same (common trick just to use padding by zeros at the end of sequence?) or
batch_size
should be 1)What if I want to have variable width and height? As I understand I need to use convolutional + global average pooling / spartial pyramid pooling layers before input to LSTM, so output blob will be
[batch_size, feature_map_height, feature_map_width, feature_map_channels]
, how should blob be reshaped to be used as input to LSTM? Like[batch_size, feature_map_width, feature_map_height*feature_map_channels]
? Can we reshape it just to single row like[batch_size, feature_map_width*feature_map_height*feature_map_channels]
it will be like sequence of pixels and we loose some spartial information, will it work?Here is definition of input, but I'm not sure what it's mean in your case
[batch_size, max_stepsize, num_features]
: https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L90And how output of LSTM depends on input size and max sequence length? https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L110
BTW: Here is some examples using 'standard' approaches in Keras+Tensorflow which I want to complement with RNN examples. https://github.com/mrgloom/Char-sequence-recognition