Height of character boxes is not correct

nate-bush commented 4 years ago

Bug report

Description: Height of character boxes is not correct on some fonts. I removed other font and graphical items from the PDF to isolate the problematic character boxes.

Steps to reproduce:

Run the script provided below on the provided PDF.
Open the output image pdf_with_boxes.png to see the boxes.

import cv2
import numpy as np
from pdf2image import convert_from_path
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTChar

def path_to_images(path, width, height):
    image_dims = (width, height)
    images_from_pdf = [np.array(img) for img in convert_from_path(path, size=image_dims)]
    return [cv2.cvtColor(image, cv2.COLOR_RGB2BGR) for image in images_from_pdf]

def draw_box_on_image(image, bbox, height, color=(255, 0, 0), thickness=1):
    bbox = bbox[0], height - bbox[3], bbox[2],  height - bbox[1]
    pt1 = (int(bbox[0]), int(bbox[1]))
    pt2 = (int(bbox[2]), int(bbox[3]))
    cv2.rectangle(image, pt1, pt2, color=color, thickness=thickness)

def parse_pages(pdf_path):

    fp = open(pdf_path, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    parser.set_document(doc)

    rsrcmgr = PDFResourceManager()
    laparams = LAParams(char_margin=3.5, all_texts=True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        yield layout

if __name__ == '__main__':

    pdf_path = 'chinese_chars_with_incorrect_char_boxes.pdf'
    output_image_path = "pdf_with_boxes.png"
    first_page = next(parse_pages(pdf_path))
    width, height = first_page.bbox[2], first_page.bbox[3]
    image = path_to_images(pdf_path, width, height)[0]

    for page in parse_pages(pdf_path):
        for tbox in page:
            if not isinstance(tbox, LTTextBox):
                continue
            for line in tbox:
                for char in line:
                    if not isinstance(char, LTChar):
                        continue
                    box = (char.x0, char.y0, char.x1, char.y1)
                    draw_box_on_image(image, box, height)

    cv2.imwrite(output_image_path, image)

chinese_chars_with_incorrect_char_boxes.pdf

pietermarsman commented 4 years ago

I can replicate this issue with the newest version of pdfminer.six. Tried cleaning the pdf with mutools and running the code again, but no difference.

rinczefi commented 4 years ago

Hi, I'm using pdfminer.six-20200726.

I have another question regarding to cases when the font type is "unknown". As I understand, if the LTChar has a font type "unknown", it will have a neglectable height beside a proper width. Is there any way to recover the character heights, as well? Why is not it implemented already?

Thank you!

jstockwin commented 4 years ago

@rinczefi Are you able to share your PDF? Sounds like we'd need to work out why the font is unknown...

rinczefi commented 4 years ago

unknown.pdf This is the PDF, I'm stuck with right now. Thanks in advance.

rinczefi commented 4 years ago

@jstockwin Is there any progress on this issue yet?

jstockwin commented 4 years ago

@rinczefi Apologies for not responding sooner. Unfortunately I am currently quite busy and so have not had much spare time for open source stuff. It's on my list of things to get around to, but no guarantees, I'm afraid.

gauranglendbuzz commented 3 years ago

@rinczefi were you able to solve this issue?

rinczefi commented 3 years ago

@rinczefi were you able to solve this issue?

No, I were not.

gauranglendbuzz commented 3 years ago

@jstockwin @pietermarsman any updates on this issue or at least what is the root cause of this issue? It would be wonderful if you can shed some lights on the root cause. Thanks in advance.

pdfminer / pdfminer.six

Height of character boxes is not correct #443