Open tau-zero opened 8 months ago
It appears that extra commands exist after all tests have been processed. In my case the extra command gets the coordinates for an element very far down the page (not visible in the original PDF and not listed in the "$extractedTexts
" array), probably offpage and which doesn't even need to be inserted into the "$this-> dataTm
". In practice it is a command that does nothing.
The problem seems to be definitively solved (successfully, no text lost) by simply removing the "$currentText
" variable assignment (line 704) placed before the "switch ($command['o'])
" statement and inserting it instead at each point of the vary cases of the switch only where necessary (four occurrences: lines 800, 818, 833 and 866)
I hope these changes can be included in the next version of the package.
We might have a fix already, please check out #639.
k00ni: I took a look at the changes introduced but I think it's just a workaround: the change effectively stops the parsing of the commands at the end of the text array, not executing further ones (commands).
I think that the cleanest solution is to change the use of the variable $extractedTexts[\count($extractedData)]
directly in the lines involved in that content and let the complete parsing of the commands take place.
The problem arises from the fact that the variable is read before the "switch/case" statement (where the variable could be inconsistent), while instead it should only be read within the relative "cases" where consistency is evident.
So, in fact:
703 foreach ($dataCommands as $command) {
704- $currentText = $extractedTexts[\count($extractedData)];
705 switch ($command['o']) {
801 case 'Tj':
802- $data = [$Tm, $currentText];
802+ $data = [$Tm, $extractedTexts[\count($extractedData)]];
803 if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
817 case "'":
818 $Ty -= $Tl;
819 $Tm[$y] = (string) $Ty;
820- $extractedData[] = [$Tm, $currentText];
820+ $extractedData[] = [$Tm, $extractedTexts[\count($extractedData)]];
821 break;
834 case '"':
835- $data = explode(' ', $currentText);
835+ $data = explode(' ', $extractedTexts[\count($extractedData)]);
836 $Ty -= $Tl;
867 case 'TJ':
868- $data = [$Tm, $currentText];
868+ $data = [$Tm, $extractedTexts[\count($extractedData)]];
869 if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
These code changes should permanently resolve the issue by allowing full parsing to occur.
@Marco-Riccardi-HTA thank you for your feedback. Would you mind creating a pull request with these changes?
@k00ni: I may give a tray (I never contribute directly so I must learn how to interact with Github)
Description:
I'm getting an error when trying to run getDataTm() on a pdf. I have tried this on a native LAMP stack and inside a Symfony 6.3.6 framework. ErrorException - Undefined array key 38
PDF input
420399_20231024.pdf
Expected output & actual output
Code
I temporary solved wrapping the read of array "$extractedTexts" in a "array_key_exists()" checking and all pdf texts seems correctly retrivied except for last text in the bottom right:"Page 1 of n" that is lost.
In "vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php":
Before fix:
After fix: