The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation.

1339503169 commented 8 months ago

Please provide all mandatory information!

Describe the bug (mandatory)

To Reproduce (mandatory)

words_test.pdf

pymupdf version is 1.23.5

The code belows can reproduces the bug

document = fitz.open('data/word_test.pdf') page = document.load_page(0) words = page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) for word in words: rect =fitz.Rect(word[0], word[1], word[2], word[3]) color = (0, 1, 0) page.draw_rect(rect, color) document.save('word_test_new.pdf')

The text boxes extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) contain some abnormal blocks that seem much larger than I anticipated. Is there room for optimization that I might be missing?

JorjMcKie commented 8 months ago

This file has an illegal font specification in that it uses "Identity-H" encoding for a non-embedded font (SimSun). This seems to cause confusion wRT character sizes (causing the extremely high bboxes). All this is outside control of PyMuPDF and has to be looked at by the MuPDF experts. Do you want to submit a bug there? https://bugs.ghostscript.com/enter_bug.cgi

1339503169 commented 7 months ago

otherway i try to convert pdf to image, it seems like the transferred image does not look consistent with the original pdf , this is the image i transfered from this pdf , is there some setting i dont set?

JorjMcKie commented 7 months ago

I see no difference - where are the deviations?

JorjMcKie commented 7 months ago

Closed b/o of waiting for response for an extended period of time.

pymupdf / PyMuPDF

The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation. #2796

Describe the bug (mandatory)

To Reproduce (mandatory)