pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.33k stars 509 forks source link

Editing CMap / ToUnicode to achieve correct character mapping when extracting text #530

Closed cakemaker7 closed 4 years ago

cakemaker7 commented 4 years ago

First of all thanks for developing and maintaining PyMuPDF. This is very helpful. I have the following problem: For some fonts in some PDFs some characters cannot be extracted correctly, because their CMap / ToUnicode doesn't make sense or is incomplete (see also https://github.com/pymupdf/PyMuPDF/issues/365).

Using the process shown in https://github.com/pymupdf/PyMuPDF/issues/365 I can extract the Mapping, but here comes the question: Assuming, I know the correct character mapping: Is there a way to edit the Character Mapping for the specific font in the PDF / Page object to correctly extract the text?

For some situations, this would be a much better solution than using OCR. An alternative idea would be to somehow replace the respective characters (in the bytecode of the PDF?) before they are processed by MuPDF, which seems much more difficult.

Any help is appreciated. Thanks!

JorjMcKie commented 4 years ago

The character mapping is done in a so-called "stream" object. I.e. a PDF object definition to which bytes are appended that are wrapped with the keywords "stream" and "endstream". Technically, using pymupdf it is possible to not only read stream content, but also to replace it. So you could do this (note the new method aliases for this):

content = doc.xrefStream(xref)  # this is a bytes object!
# do something with content, then
doc.updateStream(xref, content)

Whether or not this will really work to your satisfaction, would be an extremely interesting experiment!

You obviously need to locate the font and dig your way through to the PDF object that actually contains the CMAP. Shouldn't be too difficult though, because the object definition strings you extract with pymupdf are by default formatted in a predictable way by MuPDF, so parsing is easy enough.

This change pertains to the font and is page-independent, so would apply to all pages using that font. But of course you can use the analogous above code to flip-flop between CMAPs, I assume.

cakemaker7 commented 4 years ago

Thank you very much. I didn't expect it be so simple to update the stream. I will try this and let you know how it works.

JorjMcKie commented 4 years ago

I will try this and let you know how it works.

Please do so! The updateStream method will automatically compress (deflate) the stream if beneficial. Hope this won't interfere with anything, but I do not think so.

cakemaker7 commented 4 years ago

I am now using something like this:

def update_tounicode_mapping(doc: fitz.Document, pno: int, font: str, old_mappings, new_mappings):
    for font_tuple in doc.getPageFontList(pno):  # Could also be done using page.loadPage(pno); page.getFontList()
        if font == font_tuple[3] or ((font_split := font_tuple[3].split("+", maxsplit=1)) and font == font_split[1]):
            # or condition to cope with "AAAAAA+"-style prefixes in the font names
            for line in doc.xrefObject(font_tuple[0]).splitlines():
                line = line.strip()
                if line.startswith("/ToUnicode"):
                    stream_id = int(line.split()[1])
                    old_stream_decoded = doc.xrefStream(stream_id).decode()
                    start = old_stream_decoded.find("beginbfchar")
                    end = old_stream_decoded.find("endbfchar", start) if start >= 0 else -1
                    if 0 <= start < end:
                        section = old_stream_decoded[start:end]
                        for old, new in zip(old_mappings, new_mappings):
                            # Limitation: Does not work if you want to replace a->b, b->c and c->a (or similar). Wouldn't be a big deal to solve, but I dont't need it right now.
                            section = section.replace(old, new)
                        new_stream_decoded = old_stream_decoded[:start] + section + old_stream_decoded[end:]
                        new_stream_encoded = new_stream_decoded.encode()
                        doc.updateStream(stream_id, new_stream_encoded)
                        break

In terms of coding style it's probably far from perfect, but so far it seems to work fine.

JorjMcKie commented 4 years ago

@cakemaker7 - and it does work? I am impressed. Is there anything where PyMuPDF's support could be improved? Can I persuade you to write a short Wiki article on this? I remember several issues / questions dealing with this type of thing ... some time ago though.

How do you know what to replace with what?

Comment on fontnames with 6-letter prefixes like "ABCDEF+": these indicate a character subset of the original font. Some PDF creator SW (LibreOffice, MS Word) looks at the set of characters actually used and then store a stripped-down version of the full font.

cakemaker7 commented 4 years ago

Yes, it works. I need it for only five different special characters, though.

I can think of three possible ways to find out the desired mapping, which may be more or less suitable depending on the case:

  1. Look at the current mapping and compare with the rendered PDF
  2. Try and error: Change the mapping and observe the text extraction result
  3. Look at the bytecode of relevant xRefStreams and make sense of it. With some practice (and potentially regex) it is possible to "read" from the bytecode and identify the correct mapping by comparing with the rendered PDF.

Regarding the PyMuPDF's support: If this problem / case is relevant enough, a more sophisticated version of my function could be integrated into PyMuPDF. I don't know how many people are having such problems, but my guess would be that there are more important issues.

Sure I can write a short wiki article if you give me a short hint where and how.

grivanov commented 2 years ago

Hi all, did @cakemaker7 write that wiki article? Is there a demo/example to check out? I'm in a similar situation and don't know that much programming but probably can adapt an example to fit my needs.

Mowmowj commented 7 months ago

@cakemaker7 hi, looks really good of your trying, I am doing a pdf extracter(use for extract pdf text) recently based on pdfjs, and facing same issue(some text /toUnicode Cmap is incorrect lead part of text extract failed) hope you could give some inspiring points.Do you think it is possible to write a general script to auto check the incorrect unicode Cmap and fix it?

xhivo97 commented 2 months ago

So, I had this issue and was able add the missing ToUnicode entries.

If you're lucky and your file has the same encoding as mine, and as long as there wasn't any manual obfuscation this should always work.

From 9.7.5.2 Predefined CMaps of the PDF specification:

When the current font is a Type 0 font whose Encoding entry is Identity-H or Identity-V, the string to
be shown shall contain pairs of bytes representing CIDs, high-order byte first. When the descendant
font of a Type 0 font is a Type 2 CIDFont in which the CIDToGIDMap entry is Identity and if the
TrueType font is embedded in the PDF file, the 2-byte CID values shall be identical glyph indices for the
glyph descriptions in the TrueType font program.

This means that you need to do the following for each font resource that meets the above requirements:

For example if the embedded font file in question has a 0x20 (space) at index 3, in every text that uses this font a 0x03 will be a space so in the ToUnicode entry we need for that is: <0003> <0020>.

@Mowmowj A truly generic solution is almost certainly not worth the effort and I recommend using the PDFBox Debugger to figure out the mapping and apply it either at an extraction level or with a ToUnicode CMap.

Here's a minimal example of how to extract the mapping (does not include applying it to the PDF file):

import fitz
from fontTools.ttLib import TTFont
from io import BytesIO

def print_to_unicode_mapping(font: TTFont):
    name_to_glyph_index = {}
    for i, (name, _) in enumerate(font.getGlyphSet().items()):
        # I don't know if there are cases where this matters, we only get the
        # first codepoint. For fixing copy pasting I guess it doesn't matter?
        if name not in name_to_glyph_index:
            name_to_glyph_index[name] = i

    # I only care about a unicode cmap (3, 1) no idea if something else works
    cmap = font.getBestCmap(((3, 1),))

    if cmap:
        for code, name in cmap.items():
            escaped = chr(code).encode("unicode_escape").decode("utf-8")
            print(f"char({escaped}): {name_to_glyph_index[name]:04X}, {code:04X}")

# Important: make sure this works in your version of pymupdf
fitz.TOOLS.set_subset_fontnames(True)

doc = fitz.open("input.pdf")

seen = set()
fonts = []
for page_num in range(len(doc)):
    page = doc[page_num]
    for (xref, _, type, name, _, encoding) in page.get_fonts():

        # Insert code to validate that the font meets the encoding conditions
        # for this to work here

        font_pdf_object = doc.xref_object(xref)
        font_file_bytes = doc.extract_font(xref)[3]

        # Someone let me know if this is correct or do I need to truly do this for
        # every font in every page?
        if name not in seen:
            fonts.append((name, TTFont(BytesIO(font_file_bytes), 0)))
            seen.add(name)

for name, font in fonts:
    print(name)
    print_to_unicode_mapping(font)
    print()