mk-docenty commented 2 months ago

Description of the bug

Hi,

I am testing a PDF file and when I try to run it using pymupdf/fitz characters are broken and my pdf is encoded with /UniKS-UTF16-H For example this image is getting

Input :

output : 5356㱊ኂ⮮ᦂ# ⯆♮ⴖ# ⛯ኺ⊲ኞ⛚

Here is my code

`

try: import pymupdf as fitz # available with v1.24.3 except ImportError: import fitz import pathlib

Open the PDF document

doc = fitz.open("2023_결산서_제9장_성과보고서.pdf") output_dir = "output_markdown"

Create the output directory if it does not exist

pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)

Get the text from each page

for page_num in range(len(doc)): page = doc[page_num] text = page.get_text("text")

# Convert the text to UTF-8 (or keep as-is if already in desired encoding)
utf8_text = text.encode('utf-8') #encode('utf-8').decode('utf-16') #text.decode('utf-16').encode("utf-8")

# Define the output .md file name
outname = pathlib.Path(output_dir) / f"page_{page_num + 1}.md"

# Save the text to the .md file
pathlib.Path(outname).write_bytes(utf8_text)

print("Text has been successfully extracted and saved as .md files.")

` is there any solution for this?

How to reproduce the bug

My pymupdf version is 1.24.5 on macos with python 3.10

python test.py

PyMuPDF version

1.24.5

Operating system

MacOS

Python version

3.10

JorjMcKie commented 2 months ago

You only attached some image. We need the reproducing document.

mk-docenty commented 2 months ago

Hi,

here is example Pdf 2023_결산서_제9장_성과보고서-7-12.pdf

JorjMcKie commented 2 months ago

All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF. All you could do is using OCR to make it readable.

mk-docenty commented 2 months ago

@JorjMcKie can I ask if I can apply CMap for pyMupdf? https://github.com/adobe-type-tools/cmap-resources/tree/master

mk-docenty commented 2 months ago

All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF. All you could do is using OCR to make it readable.

Thank you for quick response

pymupdf / PyMuPDF

Charcaters are broken for Korean language pdf with /UniKS-UTF16-H encoding #3785

Description of the bug

Open the PDF document

Create the output directory if it does not exist

Get the text from each page

How to reproduce the bug

PyMuPDF version

Operating system

Python version