Closed samkit-jain closed 1 year ago
Merging #462 (5311913) into develop (002500a) will increase coverage by
0.00%
. The diff coverage is100.00%
.:exclamation: Current head 5311913 differs from pull request most recent head ca8bfbb. Consider uploading reports for the commit ca8bfbb to get more accurate results
@@ Coverage Diff @@
## develop #462 +/- ##
========================================
Coverage 98.28% 98.29%
========================================
Files 10 10
Lines 1227 1232 +5
========================================
+ Hits 1206 1211 +5
Misses 21 21
Impacted Files | Coverage Δ | |
---|---|---|
pdfplumber/container.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 002500a...ca8bfbb. Read the comment docs.
Thanks, @samkit-jain! A very interesting issue here. I've spent some time looking at it, and what follows is my understanding.
As you note, it's just one character that's causing problems, and the problem is due to the fontname
property being represented as bytes rather than a string. Here's why I think that's happening:
FontDescriptor
is defined as such: <</Type/FontDescriptor/FontName/BABGUC+#cb#ce#cc#e5/FontBBox[-7 -140 1000 859]/Flags 65568
/Ascent 859
/CapHeight 859
/Descent -140
/ItalicAngle 0
/StemV 150
/CIDSet 65 0 R
/FontFile2 53 0 R>>
... so the /FontName
entry in that descriptor dictionary is the PostScript literal /BABGUC+#cb#ce#cc#e5
When pdfminer.six
(and pdfminer
before it) encounters a #
inside a PostScript literal, it interprets the two subsequent characters as a hexadecimal-encoded byte. From what I understand, it also tries interpreting all PostScript literals as utf-8
strings, but allows un-decodable strings to pass through as bytes
objects. So I believe that's why /BABGUC+#cb#ce#cc#e5
is coming through as b"\xcb\xce\xcc\xe5"
. (Maybe there's a logic flaw that leads to the BABGUC+
part being missing?)
From what I can tell online, however, PostScript FontName
s are supposed to be strings and not contain #-hex-hex
encodings, though I'm not 100% sure.
Since this is just one character in one PDF, and we haven't encountered this issue before (despite many people using pdfplumber
to parse documents with non-Latin alphabets), I'd lean toward not adding code to handle it. Instead, the original issue might be addressed by adding these two lines inside the for page
loop:
for char in page.objects["char"]:
char["fontname"] = str(char["fontname"])
What do you think?
Hi @jsvine This is very insightful. Yes, it could be that the font name was in a non-UTF-8 format, say UTF-16. I also tried repairing the PDF using Ghostscript but it didn't have any effect. I agree that it is a rare case but I am not sure if not adding a fix in the library is the right way to go. But, yes, can wait and see if more people report a similar issue.
Sounds good — let's wait to see whether this may be a more widespread issue with some PDFs.
Thanks, @samkit-jain! A very interesting issue here. I've spent some time looking at it, and what follows is my understanding.
As you note, it's just one character that's causing problems, and the problem is due to the
fontname
property being represented as bytes rather than a string. Here's why I think that's happening:
- In the raw PDF, the
FontDescriptor
is defined as such:<</Type/FontDescriptor/FontName/BABGUC+#cb#ce#cc#e5/FontBBox[-7 -140 1000 859]/Flags 65568 /Ascent 859 /CapHeight 859 /Descent -140 /ItalicAngle 0 /StemV 150 /CIDSet 65 0 R /FontFile2 53 0 R>>
- ... so the
/FontName
entry in that descriptor dictionary is the PostScript literal/BABGUC+#cb#ce#cc#e5
- When
pdfminer.six
(andpdfminer
before it) encounters a#
inside a PostScript literal, it interprets the two subsequent characters as a hexadecimal-encoded byte. From what I understand, it also tries interpreting all PostScript literals asutf-8
strings, but allows un-decodable strings to pass through asbytes
objects. So I believe that's why/BABGUC+#cb#ce#cc#e5
is coming through asb"\xcb\xce\xcc\xe5"
. (Maybe there's a logic flaw that leads to theBABGUC+
part being missing?)From what I can tell online, however, PostScript
FontName
s are supposed to be strings and not contain#-hex-hex
encodings, though I'm not 100% sure.Since this is just one character in one PDF, and we haven't encountered this issue before (despite many people using
pdfplumber
to parse documents with non-Latin alphabets), I'd lean toward not adding code to handle it. Instead, the original issue might be addressed by adding these two lines inside thefor page
loop:for char in page.objects["char"]: char["fontname"] = str(char["fontname"])
What do you think?
still doesn't work all the time
Ended up implementing a similar, but slightly different version of this, based on some research about the common bytes-typed fontnames we saw cropping up in issues/discussions: https://github.com/jsvine/pdfplumber/pull/862/commits/9441ff7628fff9f69d81c6afd8ef439bf101b254
Thank you for getting the ball rolling on this, @samkit-jain!
This PR fixes #461. The issue was happening because, in the PDF, there was a char object as
and since the
fontname
was of typebytes
and notstr
, thesorted
method threw an error as not all thefontname
s were having the same type.As a fix, I made it such that when fetching the
chars
, we do an explicit type cast to a string of fieldstext
andfontname
.@jsvine My concern with this is that
pdfplumber
will be altering thefontname
as provided in the PDF in which case, a better solution might be to only to do the typecast when calling thesorted(...)
method inyield_unique_chars(...)
. What do you think?