jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Explicitly typecast `fontname` and `text` fields to str for char objects #462

Closed samkit-jain closed 1 year ago

samkit-jain commented 3 years ago

This PR fixes #461. The issue was happening because, in the PDF, there was a char object as

{
    "fontname": b"\xcb\xce\xcc\xe5",
    "adv": Decimal("5.000"),
    "upright": True,
    "x0": Decimal("-146.900"),
    "y0": Decimal("782.030"),
    "x1": Decimal("-141.900"),
    "y1": Decimal("792.030"),
    "width": Decimal("5.000"),
    "height": Decimal("10.000"),
    "size": Decimal("10.000"),
    "object_type": "char",
    "page_number": 1,
    "stroking_color": [1, 1, 1],
    "non_stroking_color": [0, 0, 0],
    "text": "B",
    "top": Decimal("49.970"),
    "bottom": Decimal("59.970"),
    "doctop": Decimal("49.970"),
}

and since the fontname was of type bytes and not str, the sorted method threw an error as not all the fontnames were having the same type.

As a fix, I made it such that when fetching the chars, we do an explicit type cast to a string of fields text and fontname.

@jsvine My concern with this is that pdfplumber will be altering the fontname as provided in the PDF in which case, a better solution might be to only to do the typecast when calling the sorted(...) method in yield_unique_chars(...). What do you think?

codecov[bot] commented 3 years ago

Codecov Report

Merging #462 (5311913) into develop (002500a) will increase coverage by 0.00%. The diff coverage is 100.00%.

:exclamation: Current head 5311913 differs from pull request most recent head ca8bfbb. Consider uploading reports for the commit ca8bfbb to get more accurate results Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #462   +/-   ##
========================================
  Coverage    98.28%   98.29%           
========================================
  Files           10       10           
  Lines         1227     1232    +5     
========================================
+ Hits          1206     1211    +5     
  Misses          21       21           
Impacted Files Coverage Δ
pdfplumber/container.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 002500a...ca8bfbb. Read the comment docs.

jsvine commented 3 years ago

Thanks, @samkit-jain! A very interesting issue here. I've spent some time looking at it, and what follows is my understanding.

As you note, it's just one character that's causing problems, and the problem is due to the fontname property being represented as bytes rather than a string. Here's why I think that's happening:

<</Type/FontDescriptor/FontName/BABGUC+#cb#ce#cc#e5/FontBBox[-7 -140 1000 859]/Flags 65568
/Ascent 859
/CapHeight 859
/Descent -140
/ItalicAngle 0
/StemV 150
/CIDSet 65 0 R
/FontFile2 53 0 R>>

From what I can tell online, however, PostScript FontNames are supposed to be strings and not contain #-hex-hex encodings, though I'm not 100% sure.

Since this is just one character in one PDF, and we haven't encountered this issue before (despite many people using pdfplumber to parse documents with non-Latin alphabets), I'd lean toward not adding code to handle it. Instead, the original issue might be addressed by adding these two lines inside the for page loop:

for char in page.objects["char"]:
    char["fontname"] = str(char["fontname"])

What do you think?

samkit-jain commented 3 years ago

Hi @jsvine This is very insightful. Yes, it could be that the font name was in a non-UTF-8 format, say UTF-16. I also tried repairing the PDF using Ghostscript but it didn't have any effect. I agree that it is a rare case but I am not sure if not adding a fix in the library is the right way to go. But, yes, can wait and see if more people report a similar issue.

jsvine commented 3 years ago

Sounds good — let's wait to see whether this may be a more widespread issue with some PDFs.

ayhamkan commented 1 year ago

Thanks, @samkit-jain! A very interesting issue here. I've spent some time looking at it, and what follows is my understanding.

As you note, it's just one character that's causing problems, and the problem is due to the fontname property being represented as bytes rather than a string. Here's why I think that's happening:

  • In the raw PDF, the FontDescriptor is defined as such:
<</Type/FontDescriptor/FontName/BABGUC+#cb#ce#cc#e5/FontBBox[-7 -140 1000 859]/Flags 65568
/Ascent 859
/CapHeight 859
/Descent -140
/ItalicAngle 0
/StemV 150
/CIDSet 65 0 R
/FontFile2 53 0 R>>
  • ... so the /FontName entry in that descriptor dictionary is the PostScript literal /BABGUC+#cb#ce#cc#e5
  • When pdfminer.six (and pdfminer before it) encounters a # inside a PostScript literal, it interprets the two subsequent characters as a hexadecimal-encoded byte. From what I understand, it also tries interpreting all PostScript literals as utf-8 strings, but allows un-decodable strings to pass through as bytes objects. So I believe that's why /BABGUC+#cb#ce#cc#e5 is coming through as b"\xcb\xce\xcc\xe5". (Maybe there's a logic flaw that leads to the BABGUC+ part being missing?)

From what I can tell online, however, PostScript FontNames are supposed to be strings and not contain #-hex-hex encodings, though I'm not 100% sure.

Since this is just one character in one PDF, and we haven't encountered this issue before (despite many people using pdfplumber to parse documents with non-Latin alphabets), I'd lean toward not adding code to handle it. Instead, the original issue might be addressed by adding these two lines inside the for page loop:

for char in page.objects["char"]:
    char["fontname"] = str(char["fontname"])

What do you think?

still doesn't work all the time

jsvine commented 1 year ago

Ended up implementing a similar, but slightly different version of this, based on some research about the common bytes-typed fontnames we saw cropping up in issues/discussions: https://github.com/jsvine/pdfplumber/pull/862/commits/9441ff7628fff9f69d81c6afd8ef439bf101b254

Thank you for getting the ball rolling on this, @samkit-jain!