pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.75k stars 533 forks source link

Getitng issue in get_text() #4037

Closed ashifaliclientpoint closed 1 week ago

ashifaliclientpoint commented 2 weeks ago

Description of the bug

I am using this library to fetch indexing of some tags, everything is working fine. But in a specific file i am getting an issue. In my file i have tags in the following index c:a:r, i:a:o, i:a:o but when I am trying to fetch index of these tags from the file, it returns below index. i:a:o, i:a:o, c:a:r

Here is my python script import fitz

file_path = "checkbox-issue.pdf" doc = fitz.open(file_path)

fitz.TOOLS.set_small_glyph_heights(True) for page in doc: text = page.get_text()
print(text)

Please provide me any solution if i am doing something wrong. Thanks

How to reproduce the bug

use below script

import fitz

file_path = "checkbox-issue.pdf" doc = fitz.open(file_path)

fitz.TOOLS.set_small_glyph_heights(True) for page in doc: text = page.get_text()
print(text)

PyMuPDF version

1.23.x or earlier

Operating system

Linux

Python version

3.9

JorjMcKie commented 2 weeks ago

The example file for problem reproduction is missing!

JorjMcKie commented 2 weeks ago

If text extraction returns these strings, then they are there and it is no bug.

ashifaliclientpoint commented 2 weeks ago

allow me some time to arrange the example file. It is Customer document so I need to arrange this.

Thanks

JorjMcKie commented 2 weeks ago

Please never submit an issue that we cannot reproduce based on its content. If you have confidential data that you cannot attach in the issue thread, you can instead a maintainer's email address - e.g. mine. Then, confidentiality is guaranteed.

JorjMcKie commented 1 week ago

I have not received a reproducing file yet. Please be aware that we will close the issue tomorrow if we do not receive required data.

JorjMcKie commented 1 week ago

Closed for lack of reproducing data.