smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.4k stars 535 forks source link

TCPDF_PARSER ERROR: Invalid object reference: Array #86

Open SNTRM-G opened 9 years ago

SNTRM-G commented 9 years ago

Context: smalot/pdfparser v0.9.25, tecnickcom/tcpdf 6.2.12, PHP 5.4.6.0; Windows 10 (64 x64).

Hi We are unable to process any pdf file. We tested with the demo page (http://www.pdfparser.org/demo) and it parses OK. But when we try $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($pdf_file_path); it always fails with the following error: PHP Fatal error: Uncaught exception 'Exception' with message 'TCPDF_PARSER ERROR: Invalid object reference: Array' in T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php:807 Stack trace:

0 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php(680): TCPDF_PARSER->Error('Invalid object ...')

1 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php(286): TCPDF_PARSER->getIndirectObject('', '173', true)

2 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php(195): TCPDF_PARSER->decodeXrefStream('173', Array)

3 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php(117): TCPDF_PARSER->getXrefData()

4 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\smalot\pdfparser\src\Smalot\PdfParser\Parser.php(88): TCPDF_PARSER->__construct('%PDF-1.3?%?????...')

5 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\smalot\pdfparser\src\Smalot\PdfParser\Parser.php(74): Smalot\PdfParser\Parser->parseContent('%PDF-1.3?%?????...') in T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php on line 807

Could you help? Thanks.

nkbaba commented 8 years ago

I found out the issue is with the PDF version. PDF 1.7 won't parse with this version in most cases.

In my case, I converted the version down to 1.4 and then extracted the text.

Refer: https://github.com/xthiago/pdf-version-converter

Although I don't want to use this. Please help me if it's possible to directly extract text from PDF 1.7

zirikatzaile commented 7 years ago

Hi ! Just 1 year late :/

I face this same problem but when I check the PDF document version via

    use \Xthiago\PDFVersionConverter\Guesser\RegexGuesser;
    $guesser = new RegexGuesser();
    $version = $guesser->guess($abspath)

I get $version = 1.3 so, so I first thought it was not the same trouble. Anyway, I do the conversion and it actually works ! So simply thanks alonemayank. Although Code is a bit ugly, at least it is functional

  public static function readPDF($abspath, $nonl = false) {
    $guesser = new RegexGuesser();
    $version = $guesser->guess($abspath);
    Log::debug("Documentor->readPDF about to parse $abspath version $version");

    $parser = new \Smalot\PdfParser\Parser();
    try {
      $pdf    = $parser->parseFile($abspath);
    } catch (\Exception $e) {
       //Tricky  but it works. Do not convert others or getting the text from them won't work right
      if(preg_match('/.*TCPDF_PARSER ERROR: Invalid object reference:.*/i',$e->getMessage())) {
        $command = new GhostscriptConverterCommand();
        $filesystem = new Filesystem();
        $converter = new GhostscriptConverter($command, $filesystem);
        $converter->convert($abspath, '1.4');
        $version = $guesser->guess($abspath);
        Log::debug("Documentor->readPDF must CONVERT $abspath TO version $version");

        $pdf    = $parser->parseFile($abspath);
      } else {
        throw $e;
      }
    }
garbinmarcelo commented 5 years ago

Hello, any news?

LuizMoratelli commented 5 years ago

I tried with 6 different files, with different PDFs versions:

  1. v.1.5 - Works fine
  2. v.1.3 - Exception: TCPDF_PARSER ERROR: Invalid object reference: Array
  3. v.1.7 - Exception: Missing catalog.
  4. v.1.4 - Works fine
  5. v.1.4 - Works fine
  6. v.1.3 - Works fine

I believe it is bad formatting of the files. But the exceptions don't explain for me what a need to change.

Update:

Update2: