smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 536 forks source link

Undefined variable in ->getObjectsByType('EmbeddedFile') but there is an attachment inside given #740

Open PrimeGhostDE opened 1 month ago

PrimeGhostDE commented 1 month ago

Description:

I have analyzed the bug so far, that there is a problem concerning embeddedFiles.

PDF input

Example pdf invoice. Rechnung RE-202400282 vom 09.10.2024 zu Ihr Zeichen.pdf

Expected output & actual output

There is only one embedded File in it (please ignore why i use foreach here)

$pdfParser = new PdfParser();
$pdfParsed = $pdfParser->parseContent($pdfContent);
$filespecs = $pdfParsed->getObjectsByType('Filespec');

foreach ($filespecs as $filespec) {
  $filespecDetails = $filespec->getDetails();
  // Output:
  array:7 [
    "AFRelationship" => "Alternative"
    "Desc" => "ZUGFeRD 2.1 Rechnung"
    "EF" => array:1 [
      "F" => array:3 [
        "DL" => "22974"
        "Length" => "1871"
        "Subtype" => "text/xml"
      ]
    ]
    "F" => "factur-x.xml"
    "Subtype" => "text/xml"
    "Type" => "Filespec"
    "UF" => "factur-x.xml"
  ]
}

$pdfParsed->getObjectsByType('EmbeddedFile'); is not empty. so lets get the first embeddedFile
foreach ($embeddedFiles as $embeddedFile) {
  $embeddedFile->getContent(); // returns Undefined variable $embeddedFile
}

expected Output should be that the variable $embeddedFile is not undefined.

Code

see above :)

Kind regards

PrimeGhostDE commented 3 days ago

Hi together, is there something new to this?

k00ni commented 3 days ago

In your code the variable $embeddedFiles is never defined:

$pdfParsed->getObjectsByType('EmbeddedFile');
foreach ($embeddedFiles as $embeddedFile) {
  $embeddedFile->getContent(); // returns Undefined variable $embeddedFile
}

It should complain that $embeddedFiles is not defined. What happens when you do $embeddedFiles = $pdfParsed->getObjectsByType('EmbeddedFile') before the loop?

PrimeGhostDE commented 2 days ago

You are right. I copied it to github without the assignment. But it stays the same -> it does not work as expected.

$pdfParser = new PdfParser();
$pdfContent = file_get_contents(Storage::path($dokument->path . '/' . $dokument->filename));
$pdfParsed = $pdfParser->parseContent($pdfContent);
$embeddedFiles = $pdfParsed->getObjectsByType('EmbeddedFile');
dd($embeddedFiles); // returns array without any elements.

it does not get into the foreach.