Closed 1339503169 closed 7 months ago
This file has an illegal font specification in that it uses "Identity-H" encoding for a non-embedded font (SimSun). This seems to cause confusion wRT character sizes (causing the extremely high bboxes). All this is outside control of PyMuPDF and has to be looked at by the MuPDF experts. Do you want to submit a bug there? https://bugs.ghostscript.com/enter_bug.cgi
otherway i try to convert pdf to image, it seems like the transferred image does not look consistent with the original pdf , this is the image i transfered from this pdf , is there some setting i dont set?
I see no difference - where are the deviations?
Closed b/o of waiting for response for an extended period of time.
Please provide all mandatory information!
Describe the bug (mandatory)
The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation.
To Reproduce (mandatory)
words_test.pdf
![image](https://github.com/pymupdf/PyMuPDF/assets/22074904/2cd99b6e-b8b2-4636-bcaa-42a8f189686b)
pymupdf version is 1.23.5
The code belows can reproduces the bug
document = fitz.open('data/word_test.pdf') page = document.load_page(0) words = page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) for word in words: rect =fitz.Rect(word[0], word[1], word[2], word[3]) color = (0, 1, 0) page.draw_rect(rect, color) document.save('word_test_new.pdf')
The text boxes extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) contain some abnormal blocks that seem much larger than I anticipated. Is there room for optimization that I might be missing?