Bug in multi-column page with dense text

vikasr111 commented 2 days ago

Bug description

I am trying to use DocTR for a document which as texts arranged in two columns and has dense texts. I noticed that the text detection is incorrect. It identified multiple overlapping text blocks because of which the text output is also incorrect.

Here's the original document:

Here's the OCR plot:

Here's the segmentation result:

How to address it?

Code snippet to reproduce the bug

from doctr.io import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf")

Error traceback

No error but the output is incorrect

Environment

python 3.10

Deep Learning backend

Torch

felixdittrich92 commented 2 days ago

Hi @vikasr111 :wave:,

Thanks for reporting :+1:

It's already planned to retrain all detection models with our new augmentation pipeline and an extended dataset for pretraining to make them more robust.

Could you please give "db_mobilenet_v3_large" as detection arch a try (this model is already pretrained with our new augmentation pipeline).

Additional you can tweak a bit around with the bin_thresh and box_thresh values (lower score -> more detected / less accure | higher score -> possible less detected / more accure) https://mindee.github.io/doctr/using_doctr/using_models.html#advanced-options

predictor = ocr_predictor(
    det_arch="db_mobilenet_v3_large",
    reco_arch="parseq",
    pretrained=True,
    preserve_aspect_ratio=False,
    symmetric_pad=False,
    )

predictor.det_predictor.model.postprocessor.bin_thresh = 0.35
predictor.det_predictor.model.postprocessor.box_thresh = 0.3

result = predictor(doc)
result.show()

Screenshot from 2024-11-25 08-33-47

felixdittrich92 commented 2 days ago

CC @odulcy-mindee A good sign that the new augmentation pipeline improves our models ^^ Nevertheless, I think we need to expand the dataset a bit.

vikasr111 commented 2 days ago

Thanks for the reply. db_mobilenet_v3_large does work better. When the new model pipelines will be available, any iea?

I have another follow up question. Is there any example on how can I plot the OCR line output on a canvas as per their geometry and eventually generate text output where texts are arranged using spaces and new lines to to maintain the layout of original document.

Here's a sample output:

========== Page 1 of 1 ==========

                                                                                                      Page 1 of 1
                                                                                               PURCHASE ORDER
                                              PURCHASE ORDER
             SENSIENT                                                                               3157276
                                         Ship To:
             Vendor:                                                  Bill To:                     Supplier:
      Company FJvbboinbio         SINEDOIBENT COLORS LLC         Sbsbwwb Colors LLC           JBSUVWVE
      US LLC                      1659 SAUGET BUSINESS         2515 North Jefferson Avenue   1421 WILLIS
      DEPT 771807                  BLVD STE A                  St. Louis MO 63106           SYRACUSE NY 13204
      P O BOX 77000                SAUGET TI 62206             314-658-7318
      DETROIT                                                  314-286-7172
      IW                                                       Attn: Accounts Payable
     48277-1807                                                APSTLColor @iuuivv.com
                                                                                                           Tax Exempt
      Vendor No# Order Date         Ship Via         Freight Terms           FOB         Payment Terms
                                                                                                               No#
                                                                                                            82-3618676
       10153300 2024-11-05                                                                    020Net
      Sensient/Supplier                                                                     PR Extended
          Item No#            Product Description        Tax Order    Qty  UM Unit Price    UM Amount          Date Due
                                                                  19958.0000 KG                       90,639.71 2025-01-24
      717701             SODIUM NITRITE FCC SPEC GRAN     N                           2.0600 LB
        SN FREE-FLOW       SODIUM NITRITE FCC SPEC GRAN
        FOOD GRADE        2000LB SACK
                                                          Total Amount                                   90,639.71
       *PLEASE CONFIRM BY REPLYING WITH CORRECT PRICING & DELIVERY TO : Colors.PurchasingSTL@ Sensient.com *
     IMPORTANT THIS PURCHASE ORDER NO. MUST APPEAR ON ALL INVOICES BILL OF LADING. PACKING SLIP. AND PACKAGES ALL INVOICES MUST
     DUPLICATE OT ACCOUNTS PAYABLE AT THE ADDRESS LISTED.                            BUYER: JEFFY SULLIVAN
     NO DELIVERIES ACCEPTED AFTER 3:00 PM (MON-FRI)
                                                                                     PHONE:
                                                                                     FAX:
                                                                                     E-MAIL jeftsulivan@example.com
     SEE REVERSE SIDE FOR TERMS & CONDITIONS
     All dates in YYYY-MM-DD format.

felixdittrich92 commented 2 days ago

CC @odulcy-mindee @vikasr111 (@odulcy-mindee correct me if that's not realistic) I think we can start in january to retrain / test with our already updated augmentation pipeline / retraining with an extended dataset will take a bit more time.

About the sec part have you already tried:

import matplotlib.pyplot as plt

result = predictor(doc)
synthetic_pages = result.synthesize()
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()

You can additional try passing resolve_blocks=True to the ocr_predictor, it's currently disabled by default because there are mostly endless possible document layouts where the algorithm fails to often. :)

mindee / doctr