mismatch in sequence of words in result.export()

PoornaSaiNagendra commented 3 years ago

🐛 Bug

The sequence of words outputted by result.export() is not the same as words in the image given as input. The columns were getting swapped.

To Reproduce

Steps to reproduce the behavior:

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images(</path/to/the/image>)
result = model(doc)
result.show(doc)
json_output = result.export()
num_words = len(json_output['pages'][0]['blocks'][0]['lines'][0]['words'])
words_list = [] words_dic = json_output['pages'][0]['blocks'][0]['lines'][0]['words']

for word in range(num_words): res = words_dic[word]['value'] words_list.append(res)
total_text = ' '.join(words_list)

I can't provide the complete image due to privacy issues but I am providing the desired part of the image for my use case.

Expected behavior

The output I am getting is:

HINDI (SPECIALI EVEN EIGHT 100 078 DISTIN 33 078 ENGLISH GENERAL) HIVE TWO 33 100 052 052 SANSKRIT GENERAL AIVEN TWO 100 33 072 072 MATHEMATICS 100 33 SUK ONE 061 061 SCIENCE 100 25 08 20 040 060 Sus ZERO SOCIAL SCIENCE 33 100 062 TWO 062

And the expected output is:

HINDI (SPECIALI) 100 33 078 078 SEVEN EIGHT DISTN ENGLISH (GENERAL) 100 33 052 052 FIVE TWO SANSKRIT GENERAL TWO 100 33 072 072 SEVEN TWO MATHEMATICS 100 33 061 061 SIX ONE SCIENCE 100 25 08 040 20 060 SIX ZERO SOCIAL SCIENCE 100 33 062 062 SIX TWO

Environment

I am using Google Colab free version

Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

![cropped_dect](https://user-images.githubusercontent.com/42320447/137127595-1bbb42b6-3035-4f32-aeb1-6c0a8133baa8.jpeg)

wget https://raw.githubusercontent.com/mindee/doctr/main/scripts/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Collecting environment information...

DocTR version: 0.4.0 TensorFlow version: 2.6.0 PyTorch version: 1.9.0+cu111 (torchvision 0.10.0+cu111) OpenCV version: 4.5.3 OS: Ubuntu 18.04.5 LTS Python version: 3.7 Is CUDA available (TensorFlow): No Is CUDA available (PyTorch): No CUDA runtime version: 11.1.105 GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5

Additional context

cropped_dect

The above image is the cropped output from result.show(doc).

Thanks for any help you can provide in resolving this issue.

charlesmindee commented 3 years ago

Hi @PoornaSaiNagendra,

Thank you for your interest in doctr! If I understand well your problem is the ordering of boxes in the output (boxes are not mapped to the correct lines/blocks and/or blocks are not ordered). We use boxes coordinates to reconstruct lines and hierarchical clustering of lines to find blocks, but this is not a very robust approach, especially when you have many columns on the page.

To help me a little bit on that since I don't have access to the document, could you plot or list the content of the different lines and/or blocks ?

Thanks a lot :pray:

PoornaSaiNagendra commented 3 years ago

Hi @charlesmindee,

Thanks for replying. Here I am providing you with the document duplicate I used. Hope that helps you in solving the issue.

Chhattisgarh_BOSE

Image source: Google images Note: No copyright infringement is intended

The above image can be found using in below link: (https://images.app.goo.gl/FQuYLc2GhUkNHz83A)

Thanks a lot

charlesmindee commented 3 years ago

Hi @PoornaSaiNagendra,

The option to resolve page lines and blocks is not activated by default, you need to activate it in the DocumentBuilder (models/utils/builder.py) to sort your document by blocks and lines, otherwise you get a unique block with a unique line inside it encapsulating al the words of the page.

I activated the option and it is not working well with your document, as I mentioned above our lines/blocks resolution algorithm is not very robust. What you can do is try to modify the geometrical parameters of the line resolution function in the builder, or use directly the coordinates of the boxes in the output to reorder the boxes as you wish to. I am sorry for this dysfunction, we are going to work on table comprehension/reconstruction as suggested in #524 in the next weeks and it may help you on that! :smile:

Best

PoornaSaiNagendra commented 3 years ago

Thanks for the suggestion. Looking forward to table comprehension/reconstruction.

Regards

PoornaSaiNagendra commented 3 years ago

Hi @charlesmindee

Actually, I am looking from extracting information in the table. To do so initially I have proceeded with regex but due to a mismatch in the alignment of words at present, the same regex might not be suitable in the long run when the issue is resolved.

Could you please let me know if there are any chances of including key information extraction(KIE) models to the pipeline at present or suggest any other alternative approach to build our own custom KIE that can be added as postprocessing of docTR.

Thanks and Regards

felixdittrich92 commented 3 years ago

@PoornaSaiNagendra do you mean something like this LayoutLM-Example if yes than take a look at Tut for the moment in this case you can replace tesseract with doctr after detection :)

PoornaSaiNagendra commented 3 years ago

@felixdittrich92 Thanks for helping me get the materials I needed, also in my case as I have data inside a table so I was looking for models similar to this that can help me in integrating doctr with downstream tasks like key information extraction 😄 As of now I am using spaCy for adding own custom entities.

charlesmindee commented 3 years ago

I am moving this to a discussion so that we can keep on discussing on that and close the bug issue.

mindee / doctr