fix(ocr): Solve the issue of missing some lines and spans due to adhesion during OCR

fix(ocr): decrease detection threshold and increase padding for better text extraction

Decrease the detection box threshold from 0.6 to 0.3 to ensure more text areas are identified, and increase the padding around each detected area from 25 to 50 pixels. This leads to a more comprehensive text extraction from documents.

fix(self_modify): merge detection boxes for optimized text region detection

Merge adjacent and overlapping detection boxes to optimize text region detection in the document. Post processing of text boxes is enhanced by consolidating them into larger text lines, taking into account their vertical and horizontal alignment. This improvement reduces fragmentation and improves the readability of detected text blocks.

opendatalab / PDF-Extract-Kit

fix(ocr): Solve the issue of missing some lines and spans due to adhesion during OCR #91