pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.56k stars 450 forks source link

The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation. #2796

Closed 1339503169 closed 7 months ago

1339503169 commented 8 months ago

Please provide all mandatory information!

Describe the bug (mandatory)

The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation.

To Reproduce (mandatory)

words_test.pdf image image

pymupdf version is 1.23.5

The code belows can reproduces the bug

document = fitz.open('data/word_test.pdf') page = document.load_page(0) words = page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) for word in words: rect =fitz.Rect(word[0], word[1], word[2], word[3]) color = (0, 1, 0) page.draw_rect(rect, color) document.save('word_test_new.pdf')

The text boxes extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) contain some abnormal blocks that seem much larger than I anticipated. Is there room for optimization that I might be missing?

JorjMcKie commented 8 months ago

This file has an illegal font specification in that it uses "Identity-H" encoding for a non-embedded font (SimSun). This seems to cause confusion wRT character sizes (causing the extremely high bboxes). All this is outside control of PyMuPDF and has to be looked at by the MuPDF experts. Do you want to submit a bug there? https://bugs.ghostscript.com/enter_bug.cgi

1339503169 commented 7 months ago

otherway i try to convert pdf to image, it seems like the transferred image does not look consistent with the original pdf , this is the image i transfered from this pdf , is there some setting i dont set? image

JorjMcKie commented 7 months ago

I see no difference - where are the deviations?

JorjMcKie commented 7 months ago

Closed b/o of waiting for response for an extended period of time.