smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.36k stars 537 forks source link

Missing PDFDocEncoding #609

Closed GreyWyvern closed 1 year ago

GreyWyvern commented 1 year ago

The Adobe PDF Reference defines a special encoding which is an extension of Latin1 such that:

Informational or content strings can be represented in Unicode. These strings include text annotations, bookmark names, article names, document information, date strings, etc. In PDF 1.1 these strings are stored in PDFDocEncoding, which is a superset of ISOLatin1. PDF Reference 1.2 - https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.2.pdf

See also: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

As this includes "document information" such as titles, authors and other details, PdfParser should use PDFDocEncoding to translate these strings.

Here is a proposed 'PDFDocEncoding.php' file I quickly mocked up but haven't tested yet. You can give it a shot; I will also see if I can create a branch where this works and submit a PR.

<?php

/**
 * @file
 *          This file is part of the PdfParser library.
 *
 * @author  Sébastien MALOT <sebastien@malot.fr>
 *
 * @date    2017-01-03
 *
 * @license LGPLv3
 *
 * @url     <https://github.com/smalot/pdfparser>
 *
 *  PdfParser is a pdf library written in PHP, extraction oriented.
 *  Copyright (C) 2017 - Sébastien MALOT <sebastien@malot.fr>
 *
 *  This program is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU Lesser General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU Lesser General Public License for more details.
 *
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program.
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
 */

// Source : https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.2.pdf
// Source : https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

namespace Smalot\PdfParser\Encoding;

/**
 * Class PDFDocEncoding
 */
class PDFDocEncoding extends AbstractEncoding
{
    public function getTranslations(): array
    {
        $encoding =
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          'breve caron circumflex dotaccent hungarumlaut ogonek ring tilde '.
          'space exclam quotedbl numbersign dollar percent ampersand quotesingle '.
          'parenleft parenright asterisk plus comma hyphen period slash zero one '.
          'two three four five six seven eight nine colon semicolon less equal '.
          'greater question at A B C D E F G H I J K L M N O P Q R S T U V W X '.
          'Y Z bracketleft backslash bracketright asciicircum underscore '.
          'grave a b c d e f g h i j k l m n o p q r s t u v w x y z '.
          'braceleft bar braceright asciitilde .notdef bullet dagger daggerdbl '.
          'ellipsis emdash endash florin fraction guilsinglleft guilsinglright '.
          'minus perthousand quotedblbase quotedblleft quotedblright quoteleft '.
          'quoteright quotesinglbase trademark fi fl Lslash OE Scaron Ydieresis '.
          'Zcaron dotlessi lslash oe scaron zcaron .notdef Euro exclamdown cent '.
          'sterling currency yen brokenbar section dieresis copyright '.
          'ordfeminine guillemotleft logicalnot .notdef registered macron degree '.
          'plusminus twosuperior threesuperior acute mu paragraph '.
          'periodcentered cedilla onesuperior ordmasculine guillemotright '.
          'onequarter onehalf threequarters questiondown Agrave Aacute '.
          'Acircumflex Atilde Adieresis Aring AE Ccedilla Egrave Eacute '.
          'Ecircumflex Edieresis Igrave Iacute Icircumflex Idieresis Eth Ntilde '.
          'Ograve Oacute Ocircumflex Otilde Odieresis multiply Oslash Ugrave '.
          'Uacute Ucircumflex Udieresis Yacute Thorn germandbls agrave aacute '.
          'acircumflex atilde adieresis aring ae ccedilla egrave eacute '.
          'ecircumflex edieresis igrave iacute icircumflex idieresis eth ntilde '.
          'ograve oacute ocircumflex otilde odieresis divide oslash ugrave '.
          'uacute ucircumflex udieresis yacute thorn ydieresis';

        return explode(' ', $encoding);
    }
}
k00ni commented 1 year ago

Fixed by #611, isn't it? @GreyWyvern

If not, please reopen.