Closed matter-funds closed 4 years ago
Interesting questions.
First an information how getText("rawdict")
character dictionaries are getting their "c"
key filled starting with v1.16.2:
uchar = PyUnicode_FromFormat("%c", ch->c);
This a Python C function which converts "A single character, represented as a C int" into unicode. The integer ch->c
is being presented by MuPDF. After Python has determined that that integer is not convertible into a unicode character, there is obviously no way of knowing the value of that integer.
Maybe a variant of the "rawdict" method would help which returns the integer instead of the unicode character (or in addition to it ...). May be this would answer this one of your questions:
Is there a way of getting the non-encoded binary text data from the PDF? That way, I can do the encoding on my end with, perhaps, a custom ToUnicode map.
To your other question
>>> for f in page.getFontList(): print(f)
[209, 'ttf', 'Type0', 'SKKIBB+CambriaMath', 'R209', 'Identity-H'] % <=== let's take this one
[30, 'cff', 'Type1', 'WNMVNH+Times-Roman', 'R30', 'WinAnsiEncoding']
[15, 'ttf', 'TrueType', 'SFDGKS+Arial,Bold', 'R15', '']
[207, 'ttf', 'TrueType', 'AOJLUJ+CambriaMath', 'R207', '']
[76, 'cff', 'Type1', 'QRXZUH+Times-Italic', 'R76', 'WinAnsiEncoding']
[10, 'cff', 'Type1', 'DQRCUU+Times-Bold', 'R10', 'WinAnsiEncoding']
>>> print(doc._getXrefString(209))
<<
/BaseFont /SKKIBB+CambriaMath
/ToUnicode 774 0 R % the xref number of the ToUnicode table
/Type /Font
/Encoding /Identity-H
/DescendantFonts [ 210 0 R ]
/Subtype /Type0
>>
>>> print(doc._getXrefString(774)) # print the PDF object definition
<<
/Filter /FlateDecode
/Length 326
>>
>>> cont = doc._getXrefStream(774) # it is a stream object, so get its data
>>> print(cont.decode()) # convert Python bytes to string
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R774 def
1 begincodespacerange
<0000><ffff>
endcodespacerange
27 beginbfrange
<06ae><06ae><22ef>
<0723><0723><dc34>
<0725><0725><dc40>
<0727><0727><dc40>
<072a><072a><dc3b>
<072c><072c><dc3d>
<072d><072d><dc47>
<072e><072f><dc3f>
<0732><0732><dc47>
<0734><0734><dc45>
<0736><0736><dc47>
<0738><073a><dc49>
<073d><073d><dc3d>
<0744><0744><210e>
<0745><0745><dc56>
<0748><0748><dc3d>
<0749><0749><dc3d>
<074a><074a><dc5b>
<0751><0751><dc3d>
<0754><0754><dc65>
<07da><07da><defd>
<0879><0879><dc74>
<087c><087c><dc7d>
<087e><087e><dc74>
<0882><0882><dc7d>
<0892><0892><dc7d>
<089e><089e><dc74>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
>>>
First an information how getText("rawdict") character dictionaries are getting their "c" key filled starting with v1.16.2
Sorry, that's on me, I should've said I was on v1.16.1. Is this materially different to what I get with doing
doc.getDisplayList().getTextPage().extractRAWDICT()
in v1.16.1?
Maybe a variant of the "rawdict" method would help which returns the integer instead of the unicode character.
Yup, that would be a nice convenience method I could use together with a (possibly custom) ToUnicode mapping. Alternatively, I don't mind getting my hands dirty and use _getXrefString, if I can use it to extract the text binary (I'm not intimately familiar with the inner workings of PDFs, although I've read parts of the Adobe PDF Manual).
As for extracting the ToUnicode stuff, thanks for showing me how to use _getXrefString, that's very helpful. I'll have to do play around with it for a bit, but it looks like a good starting point.
I have just experimented a little. You can always do this:
>>> for b in page.getText("rawdict")["blocks"]:
for l in b["lines"]:
for s in l["spans"]:
for char in s["chars"]: print(ord(char["c"]))
This will print the integer behind the unicode char["c"]
instead the character itself ...
Sorry, that's on me, I should've said I was on v1.16.1. Is this materially different to what I get with doing doc.getDisplayList().getTextPage().extractRAWDICT() in v1.16.1?
I believe I have made that change to PyUnicode_FromFormat("%c", ch->c);
in version 1.16.2.
Otherwise, page.getText("rawdict")
is just a shortcut, which obsoletes creating DisplayList / TextPage (which is still done under the hood of course).
Yup, that would be a nice convenience method I could use together with a (possibly custom) ToUnicode mapping.
Using ord(char["c"])
when interpreting the character dictionary, makes my idea from before obsolete: it is already what we want. Even doing a print(char)
for a char
in span["chars"]
will already now (v1.16.2) show something like
{'origin': (223.9169921875, 674.6399536132812),
'bbox': (223.9169921875, 666.78955078125, 232.49581909179688, 676.87939453125),
'c': '\udc40'}
for a non-printable character. Which corresponds to the integer 0xDC40 in this case.
I've had another look at this - I don't think doing ord(char['c'])
is going to work in this case. The issue is that the character is set to unicode 65533 somewhere upstream, either in mupdf or pymupdf code.
When running with 1.16.2, on the pdf I provided:
>>> pdfp=136
>>> ord(self.doc[pdfp].getText('rawdict')['blocks'][-1]['lines'][-1]['spans'][-1]['chars'][-1]['c'])
65533
>>> ord(self.doc[pdfp].getText('rawdict')['blocks'][-1]['lines'][-1]['spans'][-1]['chars'][-2]['c'])
65533
>>> ord(self.doc[pdfp].getText('rawdict')['blocks'][-1]['lines'][-1]['spans'][-1]['chars'][-3]['c'])
65533
So the last 3 characters are all 65533. (All question mark characters are at that codepoint, really)
So you're right that ord(c)
will give the unicode for that character, but the problem is rawdict doesn't contain the pdf codepoint - it gets set to 65533. Again, this is probably because either the internal ToUnicode map is missing or it's ignored.
I see. I'll check if there happens a translation of ch->c
to 65533 in PyUnicode_FromFormat("%c", ch->c)
.
If yes, then returning the integer ch->c
will indeed help. if no, we are stuffed.
The font used in your case from last post seems to be C2_8
, which points to xref 1615. This font has no ToUnicode array BTW.
I extracted the page's /Contents stream and looked at the last text output commands.
Short comment: MuPDF text extraction always leaves text in the sequence as specified there. So your pointing to the last chr items should come from these position.
The bottom of this 40KB stream looks like this (line breaks are mine). The Tf
command references a font (here: C2_8, fontsize 1), which remains in effect until overruled. This is followed by a Tm
matrix which in this case translates and scales (factor 6.5 in both directions).
...
/C2_8 1 Tf 6.5 0 0 6.5 319.0776 392.3218 Tm
<001300120012>Tj
0.48 -3.538 Td <00170012>Tj
0.48 -3.538 Td <0012>Tj
-0.998 -3.538 Td <000A00170012000B>Tj
-0.48 -3.538 Td <000A001300120012000B>Tj
T* <000A001300170012000B>Tj
T* <000A001400120012000B>Tj
-0.6 -3.538 Td <001E000A001400120012000B>Tj % you reported last 3 chars from here
ET
< end of stream >
So, your 3 characters reported as 65533 should be the last 3 characters of the last Tj
command, i.e. 00120012000B
.
The best ever achievable, would be a mechanism which returns the integers 18, 18, 11 here, right?
What really happens when those numbers are turned into the glyphs we are actually seeing, is buried in the font's program at xref 1085 (specified as /FontFile2) and cannot be traced (by anything I know of).
But let's see, if ch->c
is telling us something ...
I'll check if there happens a translation of ch->c to 65533 in PyUnicode_FromFormat("%c", ch->c).
I'd expect the issue to come from there. To be fair, it's probably following spec, as the values in the pdf are mapping into some garbage unicode space.
This font has no ToUnicode array BTW.
Tragically so - I noticed that as I was digging around with _getXrefString.
To be honest, I'll probably end up extracting the pixmap and using some OCR to get the text. It's way slower, but more robust.
That being said, thanks for looking into the belly of the PDF.
The best ever achievable, would be a mechanism which returns the integers 18, 18, 11 here, right?
That's correct - although now that I think about it, without the ToUnicode mapping, this information might be of limited value.
Another related point is that RAWDICT returns a shorter form of the font basename (I don't know what the right terminology is). As an example, the font list for page 136:
>>> pd.DataFrame(self.doc[136].getFontList(), columns = ['xref', 'ext', 'type', 'basefont', 'name', 'encoding'])
xref ext type basefont name encoding
0 1602 ttf Type0 XQATRQ+FrutigerLTCom-Light C2_0 Identity-H
1 1610 ttf Type0 WIQNDS+FrutigerLTCom-LightItalic C2_1 Identity-H
2 1611 ttf Type0 WIQNDS+FrutigerLTCom-Bold C2_2 Identity-H
3 1617 ttf Type0 WIQNDS+FrutigerLTCom-LightCnIta C2_3 Identity-H
4 1613 ttf Type0 WIQNDS+FrutigerLTCom-LightCn C2_4 Identity-H
5 1618 ttf Type0 WIQNDS+FrutigerLTCom-BoldCn C2_5 Identity-H
6 1614 ttf Type0 WIQNDS+FrutigerLTCom-Bold C2_6 Identity-H
7 1616 ttf Type0 WIQNDS+FrutigerLTCom-LightCnIta C2_7 Identity-H
8 1615 ttf Type0 WIQNDS+FrutigerLTCom-LightCn C2_8 Identity-H
9 1619 ttf Type0 WIQNDS+FrutigerLTCom-BoldCn C2_9 Identity-H
10 1612 cff Type1 JUBNDS+Myriad-Roman T1_0 WinAnsiEncoding
11 1601 ttf TrueType JUBNDS+FrutigerLTCom-Light TT0 WinAnsiEncoding
12 1605 ttf TrueType JUBNDS+FrutigerLTCom-Bold TT1 WinAnsiEncoding
The rawdict will contain font names like FrutigerLTCom-Light
, so you can't map them back into the right fontlist entry - is it XQATRQ+FrutigerLTCom-Light
or JUBNDS+FrutigerLTCom-Light
? They have different xrefs, so I assume they're different font objects.
Again, I think this is mostly academic at this point (going via the OCR route makes more sense in my case), but thought I should mention it.
Some tangential thoughts After looking at some more PDFs I have to work with, this seems to be a whole class of errors where the text can't be easily extracted from the document. No reader seems to be able to handle these cases; copy pasting text from the pdf results in lots of '?'.
I think one strategy to deal with these cases is:
for each unmappable character in the pdf:
render the character as image
use ocr to identify the character
map that character to the correct unicode point
It's an expensive operation (and not 100% accurate), but:
Anyway, thanks for the help with this - on my side I'll just take the easy way out and OCR the pages. Adding the suggested API (returning the original pdf codepoints) might be helpful to other people in the future, if you decide to add it.
Hmmm - no success. I'm geeting this:
{'origin': (311.81060791015625, 562.3582763671875), 'bbox': (311.81060791015625, 557.4832763671875, 315.7106018066406, 563.9832763671875), 'c': '�', 'code': 65533}
So the value 65533 of ch->c
is determined earlier in MuPDF and we have no way to see the HOWs and WHYs.
I have marked the text of your example characters: a block with 1 line, line with 1 span of 6 characters. Drawing a rectangle around that line's bbox shows this:
After looking at some more PDFs I have to work with, this seems to be a whole class of errors where the text can't be easily extracted from the document. No reader seems to be able to handle these cases; copy pasting text from the pdf results in lots of '?'.
Maybe a way to prevent exactly this copy / paste procedure: only the fontfile program knows how to do the translation. I have seen a couple of efforts in that direction.
The rawdict will contain font names like FrutigerLTCom-Light, so you can't map them back into the right fontlist entry - is it XQATRQ+FrutigerLTCom-Light or JUBNDS+FrutigerLTCom-Light? They have different xrefs, so I assume they're different font objects.
Yes, some fonts are containers of subfonts, which can be referenced in that manner. Doing so helps limiting the PDF file size - a big space saver potentially. As this part of the code (everything except html, xhtml and xml) is my own making, it would be easy to return the full font name: currently I am stripping of any prefix delimited by "+" in an effort to follow MuPDF's logic here. I am willing to follow a request for the full name instead ... ;-)
Noted - my experience with pymupdf is currently limited, but I'm sure I'll get more intimate with it over time.
Welcome on board anyway, hope you will enjoy the package.
@victor-ab - This is the approach:
page.get_text("rawdict", flags=0)
, which will give a hierarchy of Python dictionariesmat = fitz.Matrix(2, 2) # potentially use to magnify / improve char image
for b in page.get_text("rawdict", flags=0)["blocks"]: # flags value excludes any images on page
for l in b["lines"]:
for s in l["spans"]:
for char in s["chars"]:
if char["c"] == 65533:
pix = page.get_pixmap(matrix=mat, clip=char["bbox"])
# call some OCR magic with 'pix' to receive recovered unicode unc
char["c"] = unc
The open point is the 'OCR magic'! The rest is more of a no-brainer.
With its v1.18.0, MuPDF has introduced integrated support of Tesseract. I have not yet extended PyMuPDF to support this. Primary reasons:
So for the time being, a somewhat clumsy way out may be to check whether a page has at least one character code 65533. If yes, hand the respective page to an outside subprocess, which executes pre-installed Tesseract with it. Then extract the text of the returned OCR-ed page ...
Hey, I am sorry, I deleted my comment here and moved it to discussion
no problem - saw it there.
@victor-ab - This is the approach:
- extract the page's text via
page.get_text("rawdict", flags=0)
, which will give a hierarchy of Python dictionaries- whenever a character (lowest hierarchy level) code is 65533, make a pixmap from its its bbox
- use OCR to interpret the pixmap into a unicode and take that value to replace 65533.
mat = fitz.Matrix(2, 2) # potentially use to magnify / improve char image for b in page.get_text("rawdict", flags=0)["blocks"]: # flags value excludes any images on page for l in b["lines"]: for s in l["spans"]: for char in s["chars"]: if char["c"] == 65533: pix = page.get_pixmap(matrix=mat, clip=char["bbox"]) # call some OCR magic with 'pix' to receive recovered unicode unc char["c"] = unc
The open point is the 'OCR magic'! The rest is more of a no-brainer.
With its v1.18.0, MuPDF has introduced integrated support of Tesseract. I have not yet extended PyMuPDF to support this. Primary reasons:
- other priorities (issue resolutions, ...)
the effort:
- extension to PyMuPDF's API: parallel methods for all (some?) text extractions
- unclear integration with Tesseract, which could (or should?!) already have been installed on the target system - including a range of detectable / supported languages, Tesseract training data and what not. How to locate it / include some configuration features in PyMuPDF, ...
So for the time being, a somewhat clumsy way out may be to check whether a page has at least one character code 65533. If yes, hand the respective page to an outside subprocess, which executes pre-installed Tesseract with it. Then extract the text of the returned OCR-ed page ...
This method is perfect for my needs , however I cannot save the ocred character . The dictionary dosen't updates . `print(page_test.get_text("rawdict", flags=0)["blocks"][0]['lines'][0]['spans'][0]['chars'][1]['c']) 'र' page_test.get_text("rawdict", flags=0)["blocks"][0]['lines'][0]['spans'][0]['chars'][1]['c'] = 'A'
print(page_test.get_text("rawdict", flags=0)["blocks"][0]['lines'][0]['spans'][0]['chars'][1]['c']) र `
@smokersan1 - a joke?
I am having issues extracting text from the following pdf: https://gofile.io/?c=TN6hln Appologies for the large pdf, I didn't want to modify the document and cut it down in case it messed up some metadata.
Anywayw, on pdf page 136 the text is extracted as '�' (it looks rendered ok).
All of those are "Replacement Characters" with code 65533.
From digging around, as far as I can tell, this is because of the Identity-H mapping in the PDF FontList:
From what I understand there are 2 possibilities here:
My questions would be:
Related discussion that ended with
won't fix
: https://github.com/pymupdf/PyMuPDF/issues/87