ilovin / lstm_ctc_ocr

Use CTC + tensorflow to OCR
https://ilovin.github.io/2017-04-06/tensorflow-lstm-ctc-ocr/
354 stars 140 forks source link

How to deal with different size of images? #9

Closed Naruto-Sasuke closed 6 years ago

Naruto-Sasuke commented 7 years ago

I have lots of segmentaion images with different sizes. Must I rescale images to the same size? that would be much better if training can be done without scaling.

ilovin commented 7 years ago

I have an idea, resize the image to the same height and padding it to the same width, then in seq_len change each time_step to the real width.(default is all 160 with batch_size length) please make a comment if you have a try whether it works or not.

Naruto-Sasuke commented 7 years ago

I processed my dataset as you told. However I have some questions about your code. I scale my images to the same height

im = cv2.resize(im,(im.shape[1],image_height))
...
batch_inputs,batch_seq_len = pad_input_sequences(np.array(image_batch))

However the self.inputs = tf.placeholder(tf.float32, [None, None, num_features]) in lstm_ocr.py shows that It uses the num_features=utils.num_features. While in utils.py, it looks lijke this:

channel = 1
image_width=100 # not useful in training
image_height=300
num_features=image_height*channel

Q1. In infer.py:

            im = cv2.resize(im,(utils.image_width,utils.image_height))

It uses the utils.image_width, however the values vary quite a lot for different images. How do deal with it?

Q2. It seems that you comment the shuffle code and data_argument functions. I only has thousands of images, Can I do something beneficial for training using these code?

ilovin commented 7 years ago

Q1. two solutions S1: use batch_size = 1 S2.1: resize each image to the same height(keep ratio), now the width of each image is different, so pad them to the same width(with zero). Notice that the same width can be the max width of each batch.
S2.2: if you want to do it better, let the lstm to not care the padded zeros, you may read this

Q2. in the master branch, I already use np.random.permutation to shuffle the batch. of course, you can try data_augment, but do not push it too "hard".(it may make the network harder to converge)

QaisarRajput commented 7 years ago

i used another naive approach to solve this however i am thinking to optimize it and use bucketing to solve variate sizes. Current Approach: i found out the max width image in my dataset and padded on right to all the images with whitespace to that width hence making all the images of same width and height. changed parameters in the config.py to width and height. This is a naive approach and may not be applicable in every case. plus it will be computationally expensive too Analysis I am into the idea of pad_input_sequences but that would only help in the CNN part. after that LSTM will have padded feature sequences. i have come accross the concept of bucketing mainly bucket_by_sequence_length which can optimize in even further Question Any suggestions about this approach. i am not sure how in this case time_step will work for variate size buckets. should it still be half of the image width? currently you are using tf.train.shuffle_batch how can we use tf.contrib.training.bucket_by_sequence_length instead

def get_data(self,path,batch_size,num_epochs):
        filename_queue = tf.train.string_input_producer([path], num_epochs=num_epochs)
        image,label,label_len,time_step= read_tfrecord_and_decode_into_image_annotation_pair_tensors(filename_queue)
        image_batch, label_batch, label_len_batch,time_step_batch = tf.train.shuffle_batch([image,label,label_len,time_step],
                                                                                           batch_size=batch_size,
                                                                                           capacity=9600,
                                                                                           num_threads=4,
                                                                                           min_after_dequeue=6400)
ilovin commented 7 years ago

I wrote a version of generating data on the fly. it pads the image to the same width for every batch the code is in the beta branch for your question, since I use CNN, the receptive field is enlarged, so time_step shall be a little less than img_width/2 if you really want to not care the padding area or just let the network to learn to not care that area, so img_width//2 shall be alright. I haven't use tf.contrib.training.bucket_by_sequence_length before, If you have tried that, leave a comment about the result.