githubharald / SimpleHTR

Handwritten Text Recognition (HTR) system implemented with TensorFlow.
https://towardsdatascience.com/2326a3487cd5
MIT License
1.99k stars 893 forks source link

No accuracy, GPU option, char list fixed number #105

Closed rzamarefat closed 3 years ago

rzamarefat commented 3 years ago

I am currently trying to train the model fro detecting handwritten words in Farsi language which is in nature cursive. After some amount of time I get absolutely 0.00 acc and all the predictions from the beginning to the end remains the same. This means there happens absolutely no training for the model. I have checked every steps that I have taken and have no idea what I am doing wrong. Is it possible to train this model for another language at all?

My second question is how can I set the training to be happening on GPU. I have a GPU and it is working fine on other training procedures on my machine but this code runs only on CPU. Is there any flag or something for this?

My last question is for the char list number. according to the decoder on English language the length of chars set must be 79 and if this condition does not satisfies the Tensorflow throws error which as far as I traced it relates to this. How can I make this suitable for any number of chars?

githubharald commented 3 years ago

I am currently trying to train the model fro detecting handwritten words in Farsi language which is in nature cursive. After some amount of time I get absolutely 0.00 acc and all the predictions from the beginning to the end remains the same. This means there happens absolutely no training for the model. I have checked every steps that I have taken and have no idea what I am doing wrong. Is it possible to train this model for another language at all?

there should not be a problem training the model for other languages. Some languages might work better, others worse, but 0% accuracy sounds like something went wrong. It might be that either you do not feed images into the model or that something went wrong with the ground truth text (never tried it with non-Latin chars, so there might be a problem with Farsi).

Checklist:

    def get_next(self) -> Batch:
        """Get next element."""
        batch_range = range(self.curr_idx, min(self.curr_idx + self.batch_size, len(self.samples)))

        imgs = [self._get_img(i) for i in batch_range]
        gt_texts = [self.samples[i].gt_text for i in batch_range]

        # TODO: add these lines
        print(gt_texts[0])
        import matplotlib.pyplot as plt
        plt.imshow(imgs[0])
        plt.show()

        self.curr_idx += self.batch_size
        return Batch(imgs, gt_texts, len(imgs))

My second question is how can I set the training to be happening on GPU. I have a GPU and it is working fine on other training procedures on my machine but this code runs only on CPU. Is there any flag or something for this?

uninstall package tensorflow, and then install package tensorflow-gpu.

My last question is for the char list number. according to the decoder on English language the length of chars set must be 79 and if this condition does not satisfies the Tensorflow throws error which as far as I traced it relates to this. How can I make this suitable for any number of chars?

when you retrain the model, it automatically uses as many chars as there are in the training set. So it is all about getting the training set right.

rzamarefat commented 3 years ago

Thank you for your helpful response. I think I find where the problem lies. The problem was due to feeding the images to the model in a wrong way.

rzamarefat commented 3 years ago

Dear Harald, I have been busy debugging the code for the last couple of days and as I have told you I understood that the way I feed the images to the model is wrong. More precisely when I log my image in the following code snippet I get "None" for the image.

def _get_img(self, i: int) -> np.ndarray:
        if self.fast:
            with self.env.begin() as txn:
                basename = Path(self.samples[i].file_path).basename()
                data = txn.get(basename.encode("ascii"))
                img = pickle.loads(data)
        else:
            print('self.samples[i].file_path:', self.samples[i].file_path)
            # img = cv2.imread(self.samples[i].file_path, cv2.IMREAD_GRAYSCALE)
            print(type(self.samples[i].file_path))
            img = cv2.imread(str(self.samples[i].file_path))
            print('img-> ', img)
            plt.imshow(img)
            plt.show()

I have checked the term "self.samples[i].file_path" for several times and I am completely sure that the src for the images is completely correct. But I cannot understand why openCV read the image as None. so the error which I tackle is the following which is natural based on what I do. Image data of dtype object cannot be converted to float any ideas how to fix this?

githubharald commented 3 years ago
rzamarefat commented 3 years ago

I solved the issue. It's on the training. Thank you very much for your reply. And Could you please explain what "--fast" does? I cannot deduce what it does with the batches(based on the code itself)

githubharald commented 3 years ago

see "Fast image loading" in README.

rzamarefat commented 3 years ago

Hi dear Harald. I have trained the model on Farsi language and the acc is a bit more than 93% which does for me well. Now in the inference stage when I run main.py it throws the following error: Cannot feed value of shape (20, 1, 79) for Tensor 'Placeholder_5:0', which has shape '(None, None, 120)' Please note that 79 is the number of all chars in charsList. Could you please give me a hint how to solve this?

githubharald commented 3 years ago

looks like a mismatch between the number of chars and the output size of the network. can you train (just for one epoch) and add print statement in main.py after the line:

char_list = loader.char_list
print(char_list) #add
print(len(char_list)) #add

and then, do inference, but add a line before model creation:

char_list = list(open(FilePaths.fn_char_list).read()) #add
print(char_list) #add
print(len(char_list)) #add
model = Model(list(open(FilePaths.fn_char_list).read()), decoder_type, must_restore=True, dump=args.dump)
infer(model, args.img_file)

then, post the outputs here. it might be something related to unicode, but I'm not sure at the moment what it is.

rzamarefat commented 3 years ago

I have done what you said and the following is the output: the char_list: [' ', '!', '"', '%', "'", '.', '?', '@', 'آ', 'ئ', 'ا', 'ب', 'ب', '\u200d', 'ت', 'ت', '\u200d', 'ث', 'ث', '\u200d', 'ج', 'ج', '\u200d', 'ح', 'ح', '\u200d', 'خ', 'خ', '\u200d', 'د', 'ذ', 'ر', 'ز', 'س', 'س', '\u200d', 'ش', 'ش', '\u200d', 'ص', 'ص', '\u200d', 'ض', 'ض', '\u200d', 'ط', 'ظ', 'ع', 'ع', '\u200d', 'غ', 'غ', '\u200d', 'ف', 'ف', '\u200d', 'ق', 'ق', '\u200d', 'ل', 'م', 'م', '\u200d', 'ن', 'ن', '\u200d', 'ه', 'ه', '\u200d', 'و', 'پ', 'پ', '\u200d', 'چ', 'چ', '\u200d', 'ژ', 'ک', 'ک', '\u200d', 'گ', 'گ', '\u200d', 'ی', 'ی', '\u200d', '\u200d', 'ب', '\u200d', 'ب', '\u200d', '\u200d', 'ت', '\u200d', '\u200d', 'ث', '\u200d', '\u200d', 'ج', '\u200d', 'ح', '\u200d', 'خ', '\u200d', 'د', '\u200d', 'ذ', '\u200d', 'ر', '\u200d', 'ز', '\u200d', 'ف', '\u200d', '\u200d', 'ق', '\u200d', '\u200d', 'ه']

Please note that '\u200d' is a special character defined in Farsi keyboard layout which is used for half space instead of one complete space

len(char_list) --> 119

These are the outputs. Please note that when I set the char_list in data_loader to the exact characters that I have in my dataset its length is 35 which makes the algorithm throw the following error: Conv2DCustomBackpropFilter: filter and out_backprop must have the same out_depth In order to go about this error I increased the number of characters with some fake and dummy chars to reach the number 79(The exact number of charlist with your dataset in English). This solves the problem. But it is not a good practice. In order for you to have a big picture of all the issues I am facing I list them:

  1. My char list length is 35 and this throws the error (and I have to add some dummy chars): Conv2DCustomBackpropFilter: filter and out_backprop must have the same out_depth
  2. After adding dummy chars and running the training it reaches the acc of 94 based on the prediction logs of batches after each epoch. But now I am facing the the aforementioned problem (Cannot feed value of shape (20, 1, 79) for Tensor 'Placeholder_5:0', which has shape '(None, None, 120)')
githubharald commented 3 years ago

difficult to tell from outside. you have to debug and see why there are only 35 chars in the mode, while there are 119 chars in the char_list.txt. also try using utf8 encoding (both for writing and reading the char_list file, see e.g. here).

rzamarefat commented 3 years ago

Dear Harald, I have debugged the code and find the solution. But this time I have a general question. I have trained the model for around 100 epochs. My data-set contains a big number of synthetic text-images (around 500k) as is shown in the following: image These are some images which a bunch of simple data augmentation techniques are applied on with the aim of making the NN learn to just concentrate on the text part not any random variation on the background or the style/font of the text. The training procedure is done as is expected and the loss has been reduced. The error and correct validation which are logged during the training phase shows that of course the learning happens in a trustworthy way. But my main question is that when I want the system to predict the text of a given one-word image (taken from let say the Internet or a cropped frame of movie which contains only one word) the performance of the system is reduced drastically. Please note that I have virtually all the common words in the target language with so many random variations on the background, size and position of the text inside the frame. With that being said, I still get a very poor result on some random data. Of course the chance of the over-fitting is really high but I want to ask if this algorithm can in essence be trained to predict such images? It is worthwhile to say that my ultimate aim is to predict OCR task on some images which are not either Scene Images or some classic OCR-like data such as images with black text on a blank white backgrounds like document-based images. Can I have any solution to achieve my goal using this code or is it impossible no matter any kind of modification is done to this code?

githubharald commented 3 years ago

so you want to recognize machine-printed text and not handwritten text? that should be possible. i would start by improving the synthetic dataset. one thing is that the images get resized to a height of 32px! for your sample image, this means it looks like this: grafik

the model itself is capable of learning this task. you can still increase model capacity if needed (number of channels, number of layers, bigger input image size). but for now i would focus on the dataset and make it more realistic and have the words more in focus.

rzamarefat commented 3 years ago

1-I really thank you for your helpful guidance. This means a lot to me. Based on your algorithm especially pre-process module I thought that the system itself feeds the data with 32px in height. Isn't this right? (Does your sentence: "the model itself is capable of learning this task." mean exactly what I said?) 2-Furthermore, about making the data-set more realistic what do you mean? can you give me some hint what kind of techniques I can use? because realistic makes my mind think about some scene text images which is completely different task in computer vision. 3-How can I make the words more in focus? can you instruct me a bit?

githubharald commented 3 years ago

yes, it resizes the images to 32px, but then the text is hardly readable any more. You should remove as much of the border as possible, so that almost only the word remains inside the image. Compare these 2 images, first one is original images resized to height 32, second one is cropped and then resized to height 32. Second one is much easier to read. grafik grafik

If you have the code that creates the synthetic dataset images, then it should be easy to do. If not, well, then it is not very easy to do, and in worst case you would have to do it manually.

rzamarefat commented 3 years ago

Actually I have written the code for generating the synthetic data myself and it is easy for me to implement such a thing. But in case I would want to adapt your algorithm to accept larger images (in height) could you please instruct me what lines of the code should be modified? Is it just in the model.py module (in the setup cnn because it is the first component of the NN)?

githubharald commented 3 years ago

yes, the cnn setup code is the relevant part. And somewhere in the code there is 32 (the height) hardcoded, which you also have to change.

rzamarefat commented 3 years ago

Thank you for the response. About your suggestion to make the height of images 32 with varying width and also making the text inside the images more focused and bold, I should say that I have implemented this and now the model just recognizes the first character in a word and also the accuracy is drastically dropped.

sctrueew commented 2 years ago

@rzamarefat Hi,

Would you be able to share your dataset and the model?

Thanks in advance

rzamarefat commented 2 years ago

@sctrueew Hi,

Thank you for your interest in our work. Unfortunately, I can't share any of these due to NDA obligation from the company I am working in.