Extracted text shows unicode character 65533

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

https://pymupdf.readthedocs.io

GNU Affero General Public License v3.0

5.08k stars 490 forks source link

Extracted text shows unicode character 65533 #365

Closed matter-funds closed 4 years ago

matter-funds commented 5 years ago

I am having issues extracting text from the following pdf: https://gofile.io/?c=TN6hln Appologies for the large pdf, I didn't want to modify the document and cut it down in case it messed up some metadata.

Anywayw, on pdf page 136 the text is extracted as '�' (it looks rendered ok).

>>> [[x['c'] for x in c['chars']] for c in self.doc[136].getDisplayList().getTextPage().extractRAWDICT()['blocks'][-1]['lines'][-1]['spans']]
[['�', '�', '�', '�', '�', '�']]

All of those are "Replacement Characters" with code 65533.

From digging around, as far as I can tell, this is because of the Identity-H mapping in the PDF FontList:

>>> pd.DataFrame(self.doc[136].getFontList()).sort_values(3)
       0    1         2                                 3     4                5
12  1605  ttf  TrueType         JUBNDS+FrutigerLTCom-Bold   TT1  WinAnsiEncoding
11  1601  ttf  TrueType        JUBNDS+FrutigerLTCom-Light   TT0  WinAnsiEncoding
10  1612  cff     Type1               JUBNDS+Myriad-Roman  T1_0  WinAnsiEncoding
2   1611  ttf     Type0         WIQNDS+FrutigerLTCom-Bold  C2_2       Identity-H
6   1614  ttf     Type0         WIQNDS+FrutigerLTCom-Bold  C2_6       Identity-H
5   1618  ttf     Type0       WIQNDS+FrutigerLTCom-BoldCn  C2_5       Identity-H
9   1619  ttf     Type0       WIQNDS+FrutigerLTCom-BoldCn  C2_9       Identity-H
4   1613  ttf     Type0      WIQNDS+FrutigerLTCom-LightCn  C2_4       Identity-H
8   1615  ttf     Type0      WIQNDS+FrutigerLTCom-LightCn  C2_8       Identity-H
3   1617  ttf     Type0   WIQNDS+FrutigerLTCom-LightCnIta  C2_3       Identity-H
7   1616  ttf     Type0   WIQNDS+FrutigerLTCom-LightCnIta  C2_7       Identity-H
1   1610  ttf     Type0  WIQNDS+FrutigerLTCom-LightItalic  C2_1       Identity-H
0   1602  ttf     Type0        XQATRQ+FrutigerLTCom-Light  C2_0       Identity-H

From what I understand there are 2 possibilities here:

The PDF owner has not provided a ToUnicode mapping, so the binary values in the pdf cannot be converted to unicode. In that case, the only way to extract the text is to provide your own mapping.
There is a ToUnicode mapping, but mupdf isn't taking it into account when it is extracting the text.

My questions would be:

Is there a way to extract the ToUnicode map? (or figure out if it missing)
Is there a way of getting the non-encoded binary text data from the PDF? That way, I can do the encoding on my end with, perhaps, a custom ToUnicode map.

Related discussion that ended with won't fix: https://github.com/pymupdf/PyMuPDF/issues/87

JorjMcKie commented 5 years ago

Interesting questions. First an information how getText("rawdict") character dictionaries are getting their "c" key filled starting with v1.16.2:

uchar = PyUnicode_FromFormat("%c", ch->c);

This a Python C function which converts "A single character, represented as a C int" into unicode. The integer ch->c is being presented by MuPDF. After Python has determined that that integer is not convertible into a unicode character, there is obviously no way of knowing the value of that integer. Maybe a variant of the "rawdict" method would help which returns the integer instead of the unicode character (or in addition to it ...). May be this would answer this one of your questions:

Is there a way of getting the non-encoded binary text data from the PDF? That way, I can do the encoding on my end with, perhaps, a custom ToUnicode map.

To your other question

There is no direct way to extract the ToUnicode mapping. But you can interactively dig your way to its content like so:

>>> for f in page.getFontList(): print(f)

[209, 'ttf', 'Type0', 'SKKIBB+CambriaMath', 'R209', 'Identity-H']   % <=== let's take this one
[30, 'cff', 'Type1', 'WNMVNH+Times-Roman', 'R30', 'WinAnsiEncoding']
[15, 'ttf', 'TrueType', 'SFDGKS+Arial,Bold', 'R15', '']
[207, 'ttf', 'TrueType', 'AOJLUJ+CambriaMath', 'R207', '']
[76, 'cff', 'Type1', 'QRXZUH+Times-Italic', 'R76', 'WinAnsiEncoding']
[10, 'cff', 'Type1', 'DQRCUU+Times-Bold', 'R10', 'WinAnsiEncoding']
>>> print(doc._getXrefString(209))
<<
  /BaseFont /SKKIBB+CambriaMath
  /ToUnicode 774 0 R    % the xref number of the ToUnicode table
  /Type /Font
  /Encoding /Identity-H
  /DescendantFonts [ 210 0 R ]
  /Subtype /Type0
>>
>>> print(doc._getXrefString(774))    # print the PDF object definition
<<
  /Filter /FlateDecode
  /Length 326
>>
>>> cont = doc._getXrefStream(774)    # it is a stream object, so get its data
>>> print(cont.decode())    # convert Python bytes to string
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R774 def
1 begincodespacerange
<0000><ffff>
endcodespacerange
27 beginbfrange
<06ae><06ae><22ef>
<0723><0723><dc34>
<0725><0725><dc40>
<0727><0727><dc40>
<072a><072a><dc3b>
<072c><072c><dc3d>
<072d><072d><dc47>
<072e><072f><dc3f>
<0732><0732><dc47>
<0734><0734><dc45>
<0736><0736><dc47>
<0738><073a><dc49>
<073d><073d><dc3d>
<0744><0744><210e>
<0745><0745><dc56>
<0748><0748><dc3d>
<0749><0749><dc3d>
<074a><074a><dc5b>
<0751><0751><dc3d>
<0754><0754><dc65>
<07da><07da><defd>
<0879><0879><dc74>
<087c><087c><dc7d>
<087e><087e><dc74>
<0882><0882><dc7d>
<0892><0892><dc7d>
<089e><089e><dc74>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

>>>

matter-funds commented 5 years ago

First an information how getText("rawdict") character dictionaries are getting their "c" key filled starting with v1.16.2

Sorry, that's on me, I should've said I was on v1.16.1. Is this materially different to what I get with doing doc.getDisplayList().getTextPage().extractRAWDICT() in v1.16.1?

Maybe a variant of the "rawdict" method would help which returns the integer instead of the unicode character.

Yup, that would be a nice convenience method I could use together with a (possibly custom) ToUnicode mapping. Alternatively, I don't mind getting my hands dirty and use _getXrefString, if I can use it to extract the text binary (I'm not intimately familiar with the inner workings of PDFs, although I've read parts of the Adobe PDF Manual).

As for extracting the ToUnicode stuff, thanks for showing me how to use _getXrefString, that's very helpful. I'll have to do play around with it for a bit, but it looks like a good starting point.

JorjMcKie commented 5 years ago

I have just experimented a little. You can always do this:

>>> for b in page.getText("rawdict")["blocks"]:
    for l in b["lines"]:
        for s in l["spans"]:
            for char in s["chars"]: print(ord(char["c"]))

This will print the integer behind the unicode char["c"] instead the character itself ...

JorjMcKie commented 5 years ago

Sorry, that's on me, I should've said I was on v1.16.1. Is this materially different to what I get with doing doc.getDisplayList().getTextPage().extractRAWDICT() in v1.16.1?

I believe I have made that change to PyUnicode_FromFormat("%c", ch->c); in version 1.16.2. Otherwise, page.getText("rawdict") is just a shortcut, which obsoletes creating DisplayList / TextPage (which is still done under the hood of course).

JorjMcKie commented 5 years ago

Yup, that would be a nice convenience method I could use together with a (possibly custom) ToUnicode mapping.

Using ord(char["c"]) when interpreting the character dictionary, makes my idea from before obsolete: it is already what we want. Even doing a print(char) for a char in span["chars"] will already now (v1.16.2) show something like

{'origin': (223.9169921875, 674.6399536132812),
'bbox': (223.9169921875, 666.78955078125, 232.49581909179688, 676.87939453125),
'c': '\udc40'}

for a non-printable character. Which corresponds to the integer 0xDC40 in this case.

matter-funds commented 4 years ago

I've had another look at this - I don't think doing ord(char['c']) is going to work in this case. The issue is that the character is set to unicode 65533 somewhere upstream, either in mupdf or pymupdf code.

When running with 1.16.2, on the pdf I provided:

>>> pdfp=136
>>> ord(self.doc[pdfp].getText('rawdict')['blocks'][-1]['lines'][-1]['spans'][-1]['chars'][-1]['c'])
65533
>>> ord(self.doc[pdfp].getText('rawdict')['blocks'][-1]['lines'][-1]['spans'][-1]['chars'][-2]['c'])
65533
>>> ord(self.doc[pdfp].getText('rawdict')['blocks'][-1]['lines'][-1]['spans'][-1]['chars'][-3]['c'])
65533

So the last 3 characters are all 65533. (All question mark characters are at that codepoint, really)

So you're right that ord(c) will give the unicode for that character, but the problem is rawdict doesn't contain the pdf codepoint - it gets set to 65533. Again, this is probably because either the internal ToUnicode map is missing or it's ignored.

JorjMcKie commented 4 years ago

I see. I'll check if there happens a translation of ch->c to 65533 in PyUnicode_FromFormat("%c", ch->c). If yes, then returning the integer ch->c will indeed help. if no, we are stuffed. The font used in your case from last post seems to be C2_8, which points to xref 1615. This font has no ToUnicode array BTW. I extracted the page's /Contents stream and looked at the last text output commands.

Short comment: MuPDF text extraction always leaves text in the sequence as specified there. So your pointing to the last chr items should come from these position.

The bottom of this 40KB stream looks like this (line breaks are mine). The Tf command references a font (here: C2_8, fontsize 1), which remains in effect until overruled. This is followed by a Tm matrix which in this case translates and scales (factor 6.5 in both directions).

...
/C2_8 1 Tf 6.5 0 0 6.5 319.0776 392.3218 Tm 
<001300120012>Tj
0.48 -3.538 Td <00170012>Tj 
0.48 -3.538 Td <0012>Tj
-0.998 -3.538 Td <000A00170012000B>Tj
-0.48 -3.538 Td <000A001300120012000B>Tj
T* <000A001300170012000B>Tj 
T* <000A001400120012000B>Tj 
-0.6 -3.538 Td <001E000A001400120012000B>Tj  % you reported last 3 chars from here
ET 
< end of stream >

So, your 3 characters reported as 65533 should be the last 3 characters of the last Tj command, i.e. 00120012000B. The best ever achievable, would be a mechanism which returns the integers 18, 18, 11 here, right? What really happens when those numbers are turned into the glyphs we are actually seeing, is buried in the font's program at xref 1085 (specified as /FontFile2) and cannot be traced (by anything I know of).

But let's see, if ch->c is telling us something ...

matter-funds commented 4 years ago

I'll check if there happens a translation of ch->c to 65533 in PyUnicode_FromFormat("%c", ch->c).

I'd expect the issue to come from there. To be fair, it's probably following spec, as the values in the pdf are mapping into some garbage unicode space.

This font has no ToUnicode array BTW.

Tragically so - I noticed that as I was digging around with _getXrefString.

To be honest, I'll probably end up extracting the pixmap and using some OCR to get the text. It's way slower, but more robust.

That being said, thanks for looking into the belly of the PDF.

The best ever achievable, would be a mechanism which returns the integers 18, 18, 11 here, right?

That's correct - although now that I think about it, without the ToUnicode mapping, this information might be of limited value.

Another related point is that RAWDICT returns a shorter form of the font basename (I don't know what the right terminology is). As an example, the font list for page 136:

>>> pd.DataFrame(self.doc[136].getFontList(), columns = ['xref', 'ext', 'type', 'basefont', 'name', 'encoding'])
    xref  ext      type                          basefont  name         encoding
0   1602  ttf     Type0        XQATRQ+FrutigerLTCom-Light  C2_0       Identity-H
1   1610  ttf     Type0  WIQNDS+FrutigerLTCom-LightItalic  C2_1       Identity-H
2   1611  ttf     Type0         WIQNDS+FrutigerLTCom-Bold  C2_2       Identity-H
3   1617  ttf     Type0   WIQNDS+FrutigerLTCom-LightCnIta  C2_3       Identity-H
4   1613  ttf     Type0      WIQNDS+FrutigerLTCom-LightCn  C2_4       Identity-H
5   1618  ttf     Type0       WIQNDS+FrutigerLTCom-BoldCn  C2_5       Identity-H
6   1614  ttf     Type0         WIQNDS+FrutigerLTCom-Bold  C2_6       Identity-H
7   1616  ttf     Type0   WIQNDS+FrutigerLTCom-LightCnIta  C2_7       Identity-H
8   1615  ttf     Type0      WIQNDS+FrutigerLTCom-LightCn  C2_8       Identity-H
9   1619  ttf     Type0       WIQNDS+FrutigerLTCom-BoldCn  C2_9       Identity-H
10  1612  cff     Type1               JUBNDS+Myriad-Roman  T1_0  WinAnsiEncoding
11  1601  ttf  TrueType        JUBNDS+FrutigerLTCom-Light   TT0  WinAnsiEncoding
12  1605  ttf  TrueType         JUBNDS+FrutigerLTCom-Bold   TT1  WinAnsiEncoding

The rawdict will contain font names like FrutigerLTCom-Light, so you can't map them back into the right fontlist entry - is it XQATRQ+FrutigerLTCom-Light or JUBNDS+FrutigerLTCom-Light? They have different xrefs, so I assume they're different font objects.

Again, I think this is mostly academic at this point (going via the OCR route makes more sense in my case), but thought I should mention it.

Some tangential thoughts After looking at some more PDFs I have to work with, this seems to be a whole class of errors where the text can't be easily extracted from the document. No reader seems to be able to handle these cases; copy pasting text from the pdf results in lots of '?'.

I think one strategy to deal with these cases is:

for each unmappable character in the pdf:
  render the character as image
  use ocr to identify the character
  map that character to the correct unicode point

It's an expensive operation (and not 100% accurate), but:

it's a one off for a document
it would beat seeing '?'.

Anyway, thanks for the help with this - on my side I'll just take the easy way out and OCR the pages. Adding the suggested API (returning the original pdf codepoints) might be helpful to other people in the future, if you decide to add it.

JorjMcKie commented 4 years ago

Hmmm - no success. I'm geeting this:

{'origin': (311.81060791015625, 562.3582763671875), 'bbox': (311.81060791015625, 557.4832763671875, 315.7106018066406, 563.9832763671875), 'c': '�', 'code': 65533}

So the value 65533 of ch->c is determined earlier in MuPDF and we have no way to see the HOWs and WHYs. I have marked the text of your example characters: a block with 1 line, line with 1 span of 6 characters. Drawing a rectangle around that line's bbox shows this:

grafik

JorjMcKie commented 4 years ago

After looking at some more PDFs I have to work with, this seems to be a whole class of errors where the text can't be easily extracted from the document. No reader seems to be able to handle these cases; copy pasting text from the pdf results in lots of '?'.

Maybe a way to prevent exactly this copy / paste procedure: only the fontfile program knows how to do the translation. I have seen a couple of efforts in that direction.

JorjMcKie commented 4 years ago

The rawdict will contain font names like FrutigerLTCom-Light, so you can't map them back into the right fontlist entry - is it XQATRQ+FrutigerLTCom-Light or JUBNDS+FrutigerLTCom-Light? They have different xrefs, so I assume they're different font objects.

Yes, some fonts are containers of subfonts, which can be referenced in that manner. Doing so helps limiting the PDF file size - a big space saver potentially. As this part of the code (everything except html, xhtml and xml) is my own making, it would be easy to return the full font name: currently I am stripping of any prefix delimited by "+" in an effort to follow MuPDF's logic here. I am willing to follow a request for the full name instead ... ;-)

matter-funds commented 4 years ago

Noted - my experience with pymupdf is currently limited, but I'm sure I'll get more intimate with it over time.

JorjMcKie commented 4 years ago

Welcome on board anyway, hope you will enjoy the package.

JorjMcKie commented 3 years ago

@victor-ab - This is the approach:

extract the page's text via page.get_text("rawdict", flags=0), which will give a hierarchy of Python dictionaries
whenever a character (lowest hierarchy level) code is 65533, make a pixmap from its its bbox
use OCR to interpret the pixmap into a unicode and take that value to replace 65533.

mat = fitz.Matrix(2, 2)  # potentially use to magnify / improve char image
for b in page.get_text("rawdict", flags=0)["blocks"]:  # flags value excludes any images on page
    for l in b["lines"]:
        for s in l["spans"]:
            for char in s["chars"]:
                if char["c"] == 65533:
                    pix = page.get_pixmap(matrix=mat, clip=char["bbox"])
                    # call some OCR magic with 'pix' to receive recovered unicode unc
                    char["c"] = unc

The open point is the 'OCR magic'! The rest is more of a no-brainer.

With its v1.18.0, MuPDF has introduced integrated support of Tesseract. I have not yet extended PyMuPDF to support this. Primary reasons:

other priorities (issue resolutions, ...)
the effort:
- extension to PyMuPDF's API: parallel methods for all (some?) text extractions
- unclear integration with Tesseract, which could (or should?!) already have been installed on the target system - including a range of detectable / supported languages, Tesseract training data and what not. How to locate it / include some configuration features in PyMuPDF, ...

So for the time being, a somewhat clumsy way out may be to check whether a page has at least one character code 65533. If yes, hand the respective page to an outside subprocess, which executes pre-installed Tesseract with it. Then extract the text of the returned OCR-ed page ...

victor-ab commented 3 years ago

Hey, I am sorry, I deleted my comment here and moved it to discussion

JorjMcKie commented 3 years ago

no problem - saw it there.

smokersan1 commented 2 years ago

@victor-ab - This is the approach:

extract the page's text via page.get_text("rawdict", flags=0), which will give a hierarchy of Python dictionaries

whenever a character (lowest hierarchy level) code is 65533, make a pixmap from its its bbox

use OCR to interpret the pixmap into a unicode and take that value to replace 65533.
mat = fitz.Matrix(2, 2)  # potentially use to magnify / improve char image
for b in page.get_text("rawdict", flags=0)["blocks"]:  # flags value excludes any images on page
    for l in b["lines"]:
        for s in l["spans"]:
            for char in s["chars"]:
                if char["c"] == 65533:
                    pix = page.get_pixmap(matrix=mat, clip=char["bbox"])
                    # call some OCR magic with 'pix' to receive recovered unicode unc
                    char["c"] = unc
The open point is the 'OCR magic'! The rest is more of a no-brainer.

With its v1.18.0, MuPDF has introduced integrated support of Tesseract. I have not yet extended PyMuPDF to support this. Primary reasons:

other priorities (issue resolutions, ...)

the effort:

extension to PyMuPDF's API: parallel methods for all (some?) text extractions

unclear integration with Tesseract, which could (or should?!) already have been installed on the target system - including a range of detectable / supported languages, Tesseract training data and what not. How to locate it / include some configuration features in PyMuPDF, ...

So for the time being, a somewhat clumsy way out may be to check whether a page has at least one character code 65533. If yes, hand the respective page to an outside subprocess, which executes pre-installed Tesseract with it. Then extract the text of the returned OCR-ed page ...

This method is perfect for my needs , however I cannot save the ocred character . The dictionary dosen't updates . `print(page_test.get_text("rawdict", flags=0)["blocks"][0]['lines'][0]['spans'][0]['chars'][1]['c']) 'र' page_test.get_text("rawdict", flags=0)["blocks"][0]['lines'][0]['spans'][0]['chars'][1]['c'] = 'A'

print(page_test.get_text("rawdict", flags=0)["blocks"][0]['lines'][0]['spans'][0]['chars'][1]['c']) र `

JorjMcKie commented 2 years ago

@smokersan1 - a joke?