Require detailed explanation on few points

vishveshtrivedi commented 3 years ago

Hello fg-midee & charlesmindee, The End to end ocr named doctr developed by you is fantastic.It is very easy to use and have very good results.Currently i am working on ocr related project.I had implemented doctr on sample images and have received good results.However I had few question which i list below and would be grateful for receiving explanatioins on them. Questions: 1)Which dbresnet50 model are you using?pretrained on synthtext dataset or tested on real world dataset as mentioned in the paper? 2)How can we fine tune the model? 3)Is their anyway we can get output after detection without postprocessing? 4)how can we improve accuracy of detection? 5)when would your private dataset be available? 6)How much training data we need to get good results on our dataset?(dataset type would be forms,invoices,receipts etc) 7)Also you have mentioned that to train the model Each JSON file must contains 3 lists of boxes.Why 3 boxes are needed for single image?

fg-mindee commented 3 years ago

Hi @vishveshtrivedi :wave:

Glad to hear that the library is useful! Here are some answers to your questions:

I'm not sure I perfectly understand your question but the trained params of DocTR are trained on real world dataset. No pretraining on Synthtext has been used yet, and we test the models on publicly available datasets + the private ones.
You can check the README of each training script for further information, but passing the --pretrained flag will start your training from the version we trained. You will need to format your dataset to the format mentioned in the README for it to work
If you are talking about raw logits, yes, you can pass return_model_output=True as an argument of your call to the model ! (cf. https://github.com/mindee/doctr/blob/main/doctr/models/detection/differentiable_binarization/pytorch.py#L174)
That's an open question but speaking for this project, we're going to extend the augmentations that we use and take backbone that will be pretrained for character classification
Well, if you're talking about the private test & training sets, I'm not sure it will ! Real world datasets include sensitive information that we cannot disclose. However we have started paving the way for synthetic datasets (cf. #414 & #262), and those will most likely be shared publicly
Hard to tell but for the detection part, we get good results starting from 100k. For the recognition part, it depends on your character distribution but in our own experience, starting from 1M. Again please note that those figures are meant to get a generic model, if you have a more specific use case, I would argue you could decrease all of those !
We're going to change this, but those are for your box quality (boxes_1 being very confident, boxes_3 being not so confident). We're most likely gonna switch to a single list of boxes, and a list of flags for confidence :+1:

Hope this helps, let me know if I misunderstood something :)

vishveshtrivedi commented 3 years ago

Thanks for quick reply. Replying to your following answers: 1) I asked the first question as in the db_resnet paper authors have mentioned that they have pretrained on synthtext and evaluated/fine-tuned on real world dataset. But as I understand your answer, you have trained on real world dataset from scratch and is giving good results! 2) I read the https://github.com/mindee/doctr/tree/main/references/detection file. I wanted to know that in order to fine tune the model (either on a font or layout) and achieve good results how much data & computing power do we need? 3) In my use case I found that after postprocessing some words were detected originally but then filtered out. That means postprocessor changes output and hence the input to the recognition. Also on line https://github.com/mindee/doctr/blob/350a96101d482be3c70b488393c362f623beff78/doctr/models/detection/differentiable_binarization/pytorch.py#L179 according to me out["preds"] would be used by the recognition model. Is there any way we can skip this and pass only the original detection output to recognition model? Few other questions: 1) Can you explain what happens in the postprocessing? I noticed some filtering based on probabilities and size but what else is done to the detected boxes that changes them so much? 2) What would be the difference between out["out_map"] & out["preds"]? Asking this because if it was possible to use prob_map directly for the input to recognition we could easily skip the postprocessing step for our use case.

Thanks a lot!!

fg-mindee commented 3 years ago

I asked the first question as in the db_resnet paper authors have mentioned that they have pretrained on synthtext and evaluated/fine-tuned on real world dataset. But as I understand your answer, you have trained on real world dataset from scratch and is giving good results!

correct!

I read the https://github.com/mindee/doctr/tree/main/references/detection file. I wanted to know that in order to fine tune the model (either on a font or layout) and achieve good results how much data & computing power do we need?

regarding the quantity of data, cf. my answer on 6 ;) For computation power, let's talk in GPU VRAM, you'll need about 10Gb to have a decent batch size

In my use case I found that after postprocessing some words were detected originally but then filtered out. That means postprocessor changes output and hence the input to the recognition. Also on line https://github.com/mindee/doctr/blob/350a96101d482be3c70b488393c362f623beff78/doctr/models/detection/differentiable_binarization/pytorch.py#L179 according to me out["preds"] would be used by the recognition model. Is there any way we can skip this and pass only the original detection output to recognition model?

For now, it's true that users cannot change the threshold for postprocessing. But once your model is instantiated, you can always do:

model.postprocessor.box_thresh = your_new_threshold

setting a lower value will keep more boxes And if you want to pass the raw output, as I said you can use return_model_output=True :ok_hand:

Can you explain what happens in the postprocessing? I noticed some filtering based on probabilities and size but what else is done to the detected boxes that changes them so much?

It's the postprocessing from the paper, if you want to check the code: https://github.com/mindee/doctr/blob/main/doctr/models/detection/core.py#L85-L116 https://github.com/mindee/doctr/blob/main/doctr/models/detection/differentiable_binarization/base.py#L79-L137

What would be the difference between out["out_map"] & out["preds"]? Asking this because if it was possible to use prob_map directly for the input to recognition we could easily skip the postprocessing step for our use case.

predictions are postprocessed results (the boxes), while out_map is the logits tensor coming out of the model (a segmentation map of sorts) :)

Let me know if that isn't very clear!

vishveshtrivedi commented 3 years ago

Thanks a lot for the reply!! One final question: 1)Why you have not used resnet101 & resnet152?

fg-mindee commented 3 years ago

Simply put:

those are much bigger architectures that require a bigger RAM capacity
if you can do it with a lighter model, you get less chance to overfit and you can save some energy & inference time

If at some point we see that we need bigger architectures, we'll try for now we favour lighter models :+1:

vishveshtrivedi commented 3 years ago

Thanks!!!!

vishveshtrivedi commented 3 years ago

Hi @fg-mindee I have one doubt.To improve detection, I changed the bin_thresh value from 0.3 to 0.2,this improved my detection but it changed my recognition.The values which were recognized correctly before have some error after the change.Is their any connection of bin_thresh with recognition??

charlesmindee commented 3 years ago

Hi @vishveshtrivedi,

The bin_thresh value is used to binarize the raw segmentation map, if you lower it most likely you will detect more words but the risk is to loose the space between words. It should probably lead to a higher recall and a lower precision for the detection task. The recognition task does not use bin_thresh, but boxes detected in the detection task with bin_thresh are used to recognize words, so in a way, it is related. If you have a too high bin_thresh you will end up with (too) large boxes, with maybe more than 1 word on each box. This can lead to a bad final recognition result because we don't have spaces in our vocabularies and thus our models can only deal with 1 word in each box.

When you say "this improved my detection", which metric do you consider ? (recall, precision, accuracy ?)
Are you able to spot the recognition errors to give me some concrete examples (ground-truth vs predictions) ?

Thank you and have a good day !

vishveshtrivedi commented 3 years ago

Hi @charlesmindee, 1)Metric used for detection is accuracy. i.e. accuracy = detected words in an image/Total words in an image 2)Considered the example of date.Ground truth = 04/14/2020 When bin_thresh was 0.3 date was 04/14/2020.When I changed it to 0.2 date recognized was 0414#2020

Thanks a lot !

charlesmindee commented 3 years ago

For the first point, it seems quite logical to detect more words when you decrease the threshold, as explained above. For the second point it is quite weird, maybe we should plot the cropped box (input of the text recognition model) for this word for both thresholds to see the difference between the 2 pictures. Do you think you can do that ? Thanks!

vishveshtrivedi commented 3 years ago

Hi @charlesmindee , For the second point, I have compared the cropped images for two different thresholds. After changing the threshold to 0.2 in one case, the box of the date in question merged with an adjacent word box. This harmed the recognition results of both the fields. This is understandable. However, in another case the box of the date in question expanded slightly along in y-axis (vertically). The recognition results were are mentioned above (0414#2020). How does such a slight change in box size harm the recognition so much? Also, I want to ask how have you calculated the value of bin_thresh to 0.3? By some experiments on the test dataset?

charlesmindee commented 3 years ago

Hi, for the first case the result is logical as you mentioned, for the second case it is quite weird. I must admit I can't really explain that, it is strange because the model did recognized the right digits but replaced the second slash and removed the first one. Which recognition model did you use ? For the value of the bin_thresh, it is an empirical set-up: the paper uses a 0.2 threshold for text in the wild (without mentioning the way to get this parameter !), but as it turns out it seems to work well for us for document text recognition with 0.3. The interesting thing is that this parameter only influences the post-processing and does not interact with the neural net during the training process, so we can adjust it afterwards and you can fine-tune it (very carefully, as you have seen it is very sensitive !) to adapt it to your own use-cases, without re-training the model.

vishveshtrivedi commented 3 years ago

Hi @charlesmindee, I have used crnn_vgg16_bn as a recognition model & db_resnet50 for detection.

charlesmindee commented 3 years ago

Hi @vishveshtrivedi, Maybe you can try with our master model to see if the same glitch is happening, for now I cannot really explain it but I will investigate that on my side. Have a nice day!

vishveshtrivedi commented 3 years ago

Hi @charlesmindee, I will try the master model.

Also, an interesting thing I noticed was that I changed the image DPI from 500 to 600 (we are converting PDF to images) and in a few images the recognition improved a lot. Is there some reason for recognition being so sensitive to DPI?

Finally, would it be recommended for the input image to be in a certain preprocessed way (binarized, greyscale, deblured, etc.)?

Thank you so much!

charlesmindee commented 3 years ago

Hi @vishveshtrivedi,

If you increase DPI, you will have higher resolution images from your pdf pages, and this can help the recognition model to distinguish letters written in small fonts or slightly blurred lines which can't be resolved at a lower resolution. However, 500 DPI is already a huge resolution (4134 x 5846 Pixel for a A4 page). Are you feeding the model with a document from_pdf or do you convert your pdf to images before creating yout document object ? We almost exclusively work with a DPI of 144, and it seems to be enough for A4 pdf pages.

You don't need to binarize/greyscale/... or preprocess your images before feeding the model, it should work fine! Of course, if you work with particularly noisy or blurred documents, it should only improve you performances to preprocess the data.

Have a nice day!

vishveshtrivedi commented 3 years ago

Hi @charlesmindee, I am converting pdf to images using pdf2image library & then using document from_images(), I am feeding the image to model.

Thanks a lot !

charlesmindee commented 3 years ago

OK, you can also use our pdf converter instantiating a document from_pdf(), it will use a 144 DPI rate for the conversion. I am moving this to a Github discussion (this will close the issue but open a discussion), it seems more appropriate to keep on chatting on technical aspects of the library!

Thanks and don't hesitate to come back with new questions !

mindee / doctr

Require detailed explanation on few points #411