Closed mk-docenty closed 2 months ago
You only attached some image. We need the reproducing document.
Hi,
here is example Pdf 2023_결산서_제9장_성과보고서-7-12.pdf
All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF. All you could do is using OCR to make it readable.
@JorjMcKie can I ask if I can apply CMap for pyMupdf? https://github.com/adobe-type-tools/cmap-resources/tree/master
All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF. All you could do is using OCR to make it readable.
Thank you for quick response
Description of the bug
Hi,
I am testing a PDF file and when I try to run it using pymupdf/fitz characters are broken and my pdf is encoded with /UniKS-UTF16-H For example this image is getting
Input :
output : 5356㱊ኂ⮮ᦂ# ⯆♮ⴖ# ⛯ኺ⊲ኞ⛚
Here is my code
`
try: import pymupdf as fitz # available with v1.24.3 except ImportError: import fitz import pathlib
Open the PDF document
doc = fitz.open("2023_결산서_제9장_성과보고서.pdf") output_dir = "output_markdown"
Create the output directory if it does not exist
pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)
Get the text from each page
for page_num in range(len(doc)): page = doc[page_num] text = page.get_text("text")
print("Text has been successfully extracted and saved as .md files.")
` is there any solution for this?
How to reproduce the bug
My pymupdf version is 1.24.5 on macos with python 3.10
python test.py
PyMuPDF version
1.24.5
Operating system
MacOS
Python version
3.10