opendatalab / PDF-Extract-Kit

A Comprehensive Toolkit for High-Quality PDF Content Extraction
GNU Affero General Public License v3.0
4.72k stars 319 forks source link

fix(ocr): Solve the issue of missing some lines and spans due to adhesion during OCR #91

Closed myhloli closed 1 month ago

myhloli commented 1 month ago

Decrease the detection box threshold from 0.6 to 0.3 to ensure more text areas are identified, and increase the padding around each detected area from 25 to 50 pixels. This leads to a more comprehensive text extraction from documents.

Merge adjacent and overlapping detection boxes to optimize text region detection in the document. Post processing of text boxes is enhanced by consolidating them into larger text lines, taking into account their vertical and horizontal alignment. This improvement reduces fragmentation and improves the readability of detected text blocks.