smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Baseencoding fallback #669

Closed GreyWyvern closed 5 months ago

GreyWyvern commented 5 months ago

When a document doesn't include a BaseEncoding header, StandardEncoding should be assumed as the default instead of an empty string.

Type of pull request

About

Some documents which are short-and-sweet may not include a BaseEncoding header. In this case, the PDF Reference 1.7 describes this encoding as a default.

Chapter 5, page 426:

Latin-text font programs produced by Adobe Systems use the Adobe standard encoding, often referred to as StandardEncoding. The name StandardEncoding has no special meaning in PDF, but this encoding does play a role as a default encoding.

Section 5.5, page 431:

  • If the Encoding entry is a dictionary, the table is initialized with the entries from the dictionary's BaseEncoding entry (see Table 5.11). Any entries in the Differences array are used to update the table. Finally, any undefined entries in the table are filled using StandardEncoding.

If the result of checking for the BaseEncoding returns an empty string, use StandardEncoding as the value instead. Resolves #665.

Checklist for code / configuration changes

GreyWyvern commented 5 months ago

PHP CS Fixer is complaining about indentation in Document.php, PDFObject.php and RawData\RawDataParser.php. Files I didn't even modify. :( Running PHP CS Fixer on my local (Windows) machine doesn't find these issues either.

k00ni commented 5 months ago

I merged #670 into master which fixes these coding style issues. Please merge master in to get rid of them.

k00ni commented 5 months ago

Thank you!