smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

ErrorException Undefined array key 38 #647

Open tau-zero opened 8 months ago

tau-zero commented 8 months ago

Description:

I'm getting an error when trying to run getDataTm() on a pdf. I have tried this on a native LAMP stack and inside a Symfony 6.3.6 framework. ErrorException - Undefined array key 38

PDF input

420399_20231024.pdf

Expected output & actual output

Code

    $parser = new Parser();
    $pdf = $parser->parseFile($uploadedFile);
    $totalPages = $pdf->getDetails()['Pages'];
    for ($p = 0; $p < $totalPages; $p++) {
          **$data[$p] = $pdf->getPages()[$p]->getDataTm();   //  <---- error happen here**
    }

I temporary solved wrapping the read of array "$extractedTexts" in a "array_key_exists()" checking and all pdf texts seems correctly retrivied except for last text in the bottom right:"Page 1 of n" that is lost.

In "vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php":

Before fix:

704 $currentText = $extractedTexts[\count($extractedData)];
705         switch ($command['o']) {

After fix:

704     if (array_key_exists(\count($extractedData), $extractedTexts)) {
705         $currentText = $extractedTexts[\count($extractedData)];
706         switch ($command['o']) {
              . . .
878         }
879 }
tau-zero commented 8 months ago

It appears that extra commands exist after all tests have been processed. In my case the extra command gets the coordinates for an element very far down the page (not visible in the original PDF and not listed in the "$extractedTexts" array), probably offpage and which doesn't even need to be inserted into the "$this-> dataTm". In practice it is a command that does nothing.

The problem seems to be definitively solved (successfully, no text lost) by simply removing the "$currentText" variable assignment (line 704) placed before the "switch ($command['o'])" statement and inserting it instead at each point of the vary cases of the switch only where necessary (four occurrences: lines 800, 818, 833 and 866)

I hope these changes can be included in the next version of the package.

k00ni commented 8 months ago

We might have a fix already, please check out #639.

Marco-Riccardi-HTA commented 8 months ago

k00ni: I took a look at the changes introduced but I think it's just a workaround: the change effectively stops the parsing of the commands at the end of the text array, not executing further ones (commands). I think that the cleanest solution is to change the use of the variable $extractedTexts[\count($extractedData)] directly in the lines involved in that content and let the complete parsing of the commands take place. The problem arises from the fact that the variable is read before the "switch/case" statement (where the variable could be inconsistent), while instead it should only be read within the relative "cases" where consistency is evident. So, in fact:

703     foreach ($dataCommands as $command) {
704-        $currentText = $extractedTexts[\count($extractedData)];
705         switch ($command['o']) {

801             case 'Tj':
802-                $data = [$Tm, $currentText];
802+                $data = [$Tm, $extractedTexts[\count($extractedData)]];
803                 if ($this->config->getDataTmFontInfoHasToBeIncluded()) {

817             case "'":
818                 $Ty -= $Tl;
819                 $Tm[$y] = (string) $Ty;
820-                $extractedData[] = [$Tm, $currentText];
820+                $extractedData[] = [$Tm, $extractedTexts[\count($extractedData)]];
821                 break;

834             case '"':
835-                $data = explode(' ', $currentText);
835+                $data = explode(' ', $extractedTexts[\count($extractedData)]);
836                 $Ty -= $Tl;

867             case 'TJ':
868-                $data = [$Tm, $currentText];
868+                $data = [$Tm, $extractedTexts[\count($extractedData)]];
869                 if ($this->config->getDataTmFontInfoHasToBeIncluded()) {

These code changes should permanently resolve the issue by allowing full parsing to occur.

k00ni commented 8 months ago

@Marco-Riccardi-HTA thank you for your feedback. Would you mind creating a pull request with these changes?

tau-zero commented 8 months ago

@k00ni: I may give a tray (I never contribute directly so I must learn how to interact with Github)