Closed vishveshtrivedi closed 3 years ago
Hi @vishveshtrivedi :wave:
Glad to hear that the library is useful! Here are some answers to your questions:
--pretrained
flag will start your training from the version we trained. You will need to format your dataset to the format mentioned in the README for it to workreturn_model_output=True
as an argument of your call to the model ! (cf. https://github.com/mindee/doctr/blob/main/doctr/models/detection/differentiable_binarization/pytorch.py#L174)Hope this helps, let me know if I misunderstood something :)
Thanks for quick reply. Replying to your following answers: 1) I asked the first question as in the db_resnet paper authors have mentioned that they have pretrained on synthtext and evaluated/fine-tuned on real world dataset. But as I understand your answer, you have trained on real world dataset from scratch and is giving good results! 2) I read the https://github.com/mindee/doctr/tree/main/references/detection file. I wanted to know that in order to fine tune the model (either on a font or layout) and achieve good results how much data & computing power do we need? 3) In my use case I found that after postprocessing some words were detected originally but then filtered out. That means postprocessor changes output and hence the input to the recognition. Also on line https://github.com/mindee/doctr/blob/350a96101d482be3c70b488393c362f623beff78/doctr/models/detection/differentiable_binarization/pytorch.py#L179 according to me out["preds"] would be used by the recognition model. Is there any way we can skip this and pass only the original detection output to recognition model? Few other questions: 1) Can you explain what happens in the postprocessing? I noticed some filtering based on probabilities and size but what else is done to the detected boxes that changes them so much? 2) What would be the difference between out["out_map"] & out["preds"]? Asking this because if it was possible to use prob_map directly for the input to recognition we could easily skip the postprocessing step for our use case.
Thanks a lot!!
- I asked the first question as in the db_resnet paper authors have mentioned that they have pretrained on synthtext and evaluated/fine-tuned on real world dataset. But as I understand your answer, you have trained on real world dataset from scratch and is giving good results!
correct!
- I read the https://github.com/mindee/doctr/tree/main/references/detection file. I wanted to know that in order to fine tune the model (either on a font or layout) and achieve good results how much data & computing power do we need?
regarding the quantity of data, cf. my answer on 6 ;) For computation power, let's talk in GPU VRAM, you'll need about 10Gb to have a decent batch size
- In my use case I found that after postprocessing some words were detected originally but then filtered out. That means postprocessor changes output and hence the input to the recognition. Also on line https://github.com/mindee/doctr/blob/350a96101d482be3c70b488393c362f623beff78/doctr/models/detection/differentiable_binarization/pytorch.py#L179 according to me out["preds"] would be used by the recognition model. Is there any way we can skip this and pass only the original detection output to recognition model?
For now, it's true that users cannot change the threshold for postprocessing. But once your model is instantiated, you can always do:
model.postprocessor.box_thresh = your_new_threshold
setting a lower value will keep more boxes
And if you want to pass the raw output, as I said you can use return_model_output=True
:ok_hand:
- Can you explain what happens in the postprocessing? I noticed some filtering based on probabilities and size but what else is done to the detected boxes that changes them so much?
It's the postprocessing from the paper, if you want to check the code: https://github.com/mindee/doctr/blob/main/doctr/models/detection/core.py#L85-L116 https://github.com/mindee/doctr/blob/main/doctr/models/detection/differentiable_binarization/base.py#L79-L137
- What would be the difference between out["out_map"] & out["preds"]? Asking this because if it was possible to use prob_map directly for the input to recognition we could easily skip the postprocessing step for our use case.
predictions are postprocessed results (the boxes), while out_map is the logits tensor coming out of the model (a segmentation map of sorts) :)
Let me know if that isn't very clear!
Thanks a lot for the reply!! One final question: 1)Why you have not used resnet101 & resnet152?
Simply put:
If at some point we see that we need bigger architectures, we'll try for now we favour lighter models :+1:
Thanks!!!!
Hi @fg-mindee I have one doubt.To improve detection, I changed the bin_thresh value from 0.3 to 0.2,this improved my detection but it changed my recognition.The values which were recognized correctly before have some error after the change.Is their any connection of bin_thresh with recognition??
Hi @vishveshtrivedi,
The bin_thresh value is used to binarize the raw segmentation map, if you lower it most likely you will detect more words but the risk is to loose the space between words. It should probably lead to a higher recall and a lower precision for the detection task. The recognition task does not use bin_thresh, but boxes detected in the detection task with bin_thresh are used to recognize words, so in a way, it is related. If you have a too high bin_thresh you will end up with (too) large boxes, with maybe more than 1 word on each box. This can lead to a bad final recognition result because we don't have spaces in our vocabularies and thus our models can only deal with 1 word in each box.
Thank you and have a good day !
Hi @charlesmindee, 1)Metric used for detection is accuracy. i.e. accuracy = detected words in an image/Total words in an image 2)Considered the example of date.Ground truth = 04/14/2020 When bin_thresh was 0.3 date was 04/14/2020.When I changed it to 0.2 date recognized was 0414#2020
Thanks a lot !
For the first point, it seems quite logical to detect more words when you decrease the threshold, as explained above. For the second point it is quite weird, maybe we should plot the cropped box (input of the text recognition model) for this word for both thresholds to see the difference between the 2 pictures. Do you think you can do that ? Thanks!
Hi @charlesmindee , For the second point, I have compared the cropped images for two different thresholds. After changing the threshold to 0.2 in one case, the box of the date in question merged with an adjacent word box. This harmed the recognition results of both the fields. This is understandable. However, in another case the box of the date in question expanded slightly along in y-axis (vertically). The recognition results were are mentioned above (0414#2020). How does such a slight change in box size harm the recognition so much? Also, I want to ask how have you calculated the value of bin_thresh to 0.3? By some experiments on the test dataset?
Hi, for the first case the result is logical as you mentioned, for the second case it is quite weird. I must admit I can't really explain that, it is strange because the model did recognized the right digits but replaced the second slash and removed the first one. Which recognition model did you use ? For the value of the bin_thresh, it is an empirical set-up: the paper uses a 0.2 threshold for text in the wild (without mentioning the way to get this parameter !), but as it turns out it seems to work well for us for document text recognition with 0.3. The interesting thing is that this parameter only influences the post-processing and does not interact with the neural net during the training process, so we can adjust it afterwards and you can fine-tune it (very carefully, as you have seen it is very sensitive !) to adapt it to your own use-cases, without re-training the model.
Hi @charlesmindee, I have used crnn_vgg16_bn as a recognition model & db_resnet50 for detection.
Hi @vishveshtrivedi, Maybe you can try with our master model to see if the same glitch is happening, for now I cannot really explain it but I will investigate that on my side. Have a nice day!
Hi @charlesmindee, I will try the master model.
Also, an interesting thing I noticed was that I changed the image DPI from 500 to 600 (we are converting PDF to images) and in a few images the recognition improved a lot. Is there some reason for recognition being so sensitive to DPI?
Finally, would it be recommended for the input image to be in a certain preprocessed way (binarized, greyscale, deblured, etc.)?
Thank you so much!
Hi @vishveshtrivedi,
If you increase DPI, you will have higher resolution images from your pdf pages, and this can help the recognition model to distinguish letters written in small fonts or slightly blurred lines which can't be resolved at a lower resolution. However, 500 DPI is already a huge resolution (4134 x 5846 Pixel for a A4 page). Are you feeding the model with a document from_pdf or do you convert your pdf to images before creating yout document object ? We almost exclusively work with a DPI of 144, and it seems to be enough for A4 pdf pages.
You don't need to binarize/greyscale/... or preprocess your images before feeding the model, it should work fine! Of course, if you work with particularly noisy or blurred documents, it should only improve you performances to preprocess the data.
Have a nice day!
Hi @charlesmindee, I am converting pdf to images using pdf2image library & then using document from_images(), I am feeding the image to model.
Thanks a lot !
OK, you can also use our pdf converter instantiating a document from_pdf(), it will use a 144 DPI rate for the conversion. I am moving this to a Github discussion (this will close the issue but open a discussion), it seems more appropriate to keep on chatting on technical aspects of the library!
Thanks and don't hesitate to come back with new questions !
Hello fg-midee & charlesmindee, The End to end ocr named doctr developed by you is fantastic.It is very easy to use and have very good results.Currently i am working on ocr related project.I had implemented doctr on sample images and have received good results.However I had few question which i list below and would be grateful for receiving explanatioins on them. Questions: 1)Which dbresnet50 model are you using?pretrained on synthtext dataset or tested on real world dataset as mentioned in the paper? 2)How can we fine tune the model? 3)Is their anyway we can get output after detection without postprocessing? 4)how can we improve accuracy of detection? 5)when would your private dataset be available? 6)How much training data we need to get good results on our dataset?(dataset type would be forms,invoices,receipts etc) 7)Also you have mentioned that to train the model Each JSON file must contains 3 lists of boxes.Why 3 boxes are needed for single image?