OCR: clarification about input and output

mrgloom commented 7 years ago

I'm trying to solve OCR tasks based on this code.

So what shape input to LSTM should have, suppose we have images [batch_size, height, width, channels] how should they be reshaped to be used as input? Like [batch_size, width, height*channels], so width is like time dimension?

What if I want to have variable width? As I understand size of sequences in batch should be the same (common trick just to use padding by zeros at the end of sequence?) or batch_size should be 1)

What if I want to have variable width and height? As I understand I need to use convolutional + global average pooling / spartial pyramid pooling layers before input to LSTM, so output blob will be [batch_size, feature_map_height, feature_map_width, feature_map_channels], how should blob be reshaped to be used as input to LSTM? Like [batch_size, feature_map_width, feature_map_height*feature_map_channels] ? Can we reshape it just to single row like [batch_size, feature_map_width*feature_map_height*feature_map_channels] it will be like sequence of pixels and we loose some spartial information, will it work?

Here is definition of input, but I'm not sure what it's mean in your case [batch_size, max_stepsize, num_features]: https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L90

And how output of LSTM depends on input size and max sequence length? https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L110

BTW: Here is some examples using 'standard' approaches in Keras+Tensorflow which I want to complement with RNN examples. https://github.com/mrgloom/Char-sequence-recognition

mrgloom commented 7 years ago

Some info described here, but it still not very clear for me:

https://stackoverflow.com/questions/38059247/using-tensorflows-connectionist-temporal-classification-ctc-implementation

So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function.

igormq commented 6 years ago

Hi @mrgloom , you can use either width or height as your ''time dimension''. Using the width you will perform a row-wise scan, otherwise you will perform a column-wise scan. Also, you can apply conv layers before the LSTM network followed by a Global Average Pooling, returning a tensor with shape [batch_size, feature_map_height, feature_map_width]

igormq / ctc_tensorflow_example

OCR: clarification about input and output #20