Open andresflorez12 opened 9 months ago
PHP Version : 8.2.0 PDFParser Version: v2.9.0
Exactly the same issue : It was working until I had to update PDFParser for PHP 8.
$parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($fileName); $page = $pdf->getPages()[2];
echo '
getText
';
print_r($page->getText());
// expected output : a different separator for lines and words (typically "\n" and ”\t” or ” "),
// actual output : "\n” and " " between each field => impossible to parse lines
echo '
getTextArray
';
print_r($page->getTextArray());
// expected output : array with the same structure as the pdf
// actual output : an array with one different value for each word
echo '
getDataTm
';
$data = $page->getDataTm();
foreach($data as $k => $td){
$text=$td[1];
if(!trim($text))continue;
echo 'text'.$text.'
';
echo 'transformation matrix = ('.$td[0][0].','.$td[0][1].','.$td[0][2].','.$td[0][3].')
';
echo 'position x='.$td[0][4].' y='.$td[0][5].'
';
}
// expected output : position different for each text element
// actual output : same position for every text element
Description:
As you can see, the PDF has a table with rows and columns, but when it is read the order of the same changes, in many cases where it is empty it does not generate a value and therefore when creating arrays so that the values are in order It is not allowed to me, since as you can see it takes the values as desired without following the rows and columns of the table.
PDF input
ARRIVALS.pdf
Expected output & actual output
Just small fragments [42] => Time,[41] => ETA/ARR [11] => 20-JUL-2010 0900 [12] => 07-DEC-2012 0045 [13] => 30-JAN-2013 0700 [14] => 23-FEB-2013 1000 [15] => 21-MAR-2013 1730 [16] => 09-MAY-2013 0820 [17] => 27-JUL-2013 2250 [18] => 02-SEP-2013 0700 [19] => 13-NOV-2013 1942 [20] => 20-NOV-2013 2052 [21] => 15-JAN-2014 1155 [22] => 17-JAN-2014 0345 [23] => 18-JAN-2014 1200 [24] => 16-APR-2014 1200 [25] => 20-JAN-2015 1700 [26] => 20-JAN-2015 1700 [27] => 07-FEB-2015 1824 [28] => 12-MAR-2015 1237 [29] => 20-OCT-2015 1115 [30] => 19-FEB-2016 1900 [31] => 17-JAN-2017 0850 [32] => 21-MAR-2017 1710 [33] => 26-APR-2017 0700 [34] => 13-JUL-2017 0800 [35] => 26-OCT-2017 0700 [36] => 19-FEB-2018 0330 [37] => 10-JUN-2018 1740 [38] => 02-JUL-2018 0600 [39] => 10-JUL-2018 0710 [40] => 29-AUG-2018 0730 [323] => IMO # [305]=>'' [306] => 1010179 [307] => 9195248 [307A]=>'' [307B]=>'' [308] => 6903395 [309] => 9147215 [310] => 9607435 [311] => 9192741 [312] => 5100532 [313] => 1010117 [314] => 9669471 [315] => 9551480 [316] => 8906169 [317] => 6725066 [318] => 743807 [319] => 9262003 [320] => 9773870 [321] => 6903981 [322] => 9235490 ACTUAL OUTPUT Array ( [0] => Arrivals [1] => Page [2] => 1 [3] => Enhanced Vessel Traffic Management System [4] => Prepared on: [5] => Report Id: [6] => SY5700RP [7] => Run by: [8] => REPORTS [9] => 05-FEB-2024 1130 [10] => Atlantic [11] => 20-JUL-2010 0900 [12] => 07-DEC-2012 0045 [13] => 30-JAN-2013 0700 [14] => 23-FEB-2013 1000 [15] => 21-MAR-2013 1730 [16] => 09-MAY-2013 0820 [17] => 27-JUL-2013 2250 [18] => 02-SEP-2013 0700 [19] => 13-NOV-2013 1942 [20] => 20-NOV-2013 2052 [21] => 15-JAN-2014 1155 [22] => 17-JAN-2014 0345 [23] => 18-JAN-2014 1200 [24] => 16-APR-2014 1200 [25] => 20-JAN-2015 1700 [26] => 20-JAN-2015 1700 [27] => 07-FEB-2015 1824 [28] => 12-MAR-2015 1237 [29] => 20-OCT-2015 1115 [30] => 19-FEB-2016 1900 [31] => 17-JAN-2017 0850 [32] => 21-MAR-2017 1710 [33] => 26-APR-2017 0700 [34] => 13-JUL-2017 0800 [35] => 26-OCT-2017 0700 [36] => 19-FEB-2018 0330 [37] => 10-JUN-2018 1740 [38] => 02-JUL-2018 0600 [39] => 10-JUL-2018 0710 [40] => 29-AUG-2018 0730 [41] => ETA/ARR [42] => Time [43] => [44] => [45] => [46] => [47] => [48] => [49] => [50] => [51] => [52] => [53] => [54] => [55] => [56] => [57] => [58] => [59] => [60] => [61] => [62] => [63] => [64] => [65] => [66] => [67] => [68] => [69] => [70] => [71] => [72] => Bk Date [73] => 3009758 [74] => 6007662 [75] => 3002206 [76] => 0354104 [77] => 3013206 [78] => 0368156 [79] => 0384917 [80] => 3013608 [81] => 3013713 [82] => 3008829 [83] => 3000660 [84] => 3013815 [85] => 3009984 [86] => 3014090 [87] => 3013301 [88] => 6006991 [89] => 3015105 [90] => 0770795 [91] => 0235661 [92] => 3016136 [93] => 3019329 [94] => 6002381 [95] => 3020297 [96] => 6014554 [97] => 3009692 [98] => 6015644 [99] => 3006914 [100] => 3015287 [101] => 0272787 [102] => 6016569 [103] => SIN [104] => [105] => DN 136 [106] => ELANDESS [107] => MAREIKE B [108] => MATT II [109] => BLACK SHEEP [110] => SONNY [111] => STADT DUESSELDORF [112] => SENTA [113] => MILITOS [114] => ATLANTIC LIGURIA [115] => CANDELA V [116] => IMAGINE [117] => ORINOQUIA I [118] => CONCEPCION [119] => CERRO ITAMUT [120] => PARITA I [121] => T/T BLACK JACK [122] => VB CALIFORNIA [123] => ADONAI [124] => CACIQUE [125] => DWS XPRESS [126] => SINA [127] => ABOUT TIME [128] => ARCANGEL SAN RAFAEL [129] => MAMMA MIA [130] => ARCANGEL SAN GABRIEL [131] => GREAT PORTOBELLO [132] => HC SVEA KIM [133] => ICB - 01 [134] => OÑI LEKUN [135] => Name [136] => 34.12 [137] => 196.85 [138] => 283.46 [139] => 120.01 [140] => 49.11 [141] => 244.09 [142] => 480.54 [143] => 40.85 [144] => 899.54 [145] => 600.39 [146] => 169.95 [147] => 214.90 [148] => 36.58 [149] => 70.00 [150] => 94.82 [151] => 89.90 [152] => 36.09 [153] => 105.31 [154] => 194.69 [155] => 151.71 [156] => 34.45 [157] => 328.05 [158] => 39.37 [159] => 95.28 [160] => 117.59 [161] => 95.47 [162] => 400.50 [163] => 424.70 [164] => 120.20 [165] => 104.99 [166] => Length [167] => 32.87 [168] => 36.19 [169] => 42.72 [170] => 45.57 [171] => 14.70 [172] => 33.53 [173] => 75.46 [174] => 11.93 [175] => 164.16 [176] => 89.99 [177] => 27.66 [178] => 40.88 [179] => 9.97 [180] => 48.26 [181] => 45.93 [182] => 40.03 [183] => 9.84 [184] => 30.35 [185] => 34.81 [186] => 28.35 [187] => 10.50 [188] => 62.32 [189] => 13.12 [190] => 41.34 [191] => 26.41 [192] => 43.41 [193] => 54.09 [194] => 52.79 [195] => 50.00 [196] => 39.37 [197] => Beam [198] => D [199] => HML [200] => CC [201] => Rest [202] => N [203] => H [204] => N [205] => N [206] => N [207] => 7 [208] => N [209] => N [210] => H [211] => 1 [212] => N [213] => N [214] => N [215] => N [216] => N [217] => N [218] => N [219] => N [220] => H [221] => N [222] => N [223] => N [224] => N [225] => N [226] => N [227] => N [228] => 7 [229] => H [230] => N [231] => N [232] => Pd [233] => S12GA [234] => Sched [235] => No. [236] => Y [237] => N [238] => N [239] => N [240] => N [241] => N [242] => N [243] => N [244] => N [245] => N [246] => N [247] => N [248] => Y [249] => N [250] => Y [251] => Y [252] => N [253] => N [254] => N [255] => N [256] => N [257] => N [258] => N [259] => N [260] => N [261] => N [262] => N [263] => N [264] => N [265] => N [266] => Tr [267] => Flg [268] => HRM [269] => CPC [270] => HRM+ [271] => HRM [272] => HRM [273] => Hold P [274] => 22-JUL-2010 0828* [275] => First Lock Time [276] => Depart Last Lock [277] => ASA [278] => ASA [279] => PA [280] => GATE [281] => AGENSA [282] => SEASAG [283] => CENTCO [284] => FERNIE [285] => INCH [286] => TINAMC [287] => STANLE [288] => PCC [289] => PCC [290] => ROZO [291] => ASA [292] => INTCAR [293] => STWARD [294] => MASTER [295] => COSCO [296] => CENTCO [297] => ATLASM [298] => MASTER [299] => ATLASM [300] => ONIX [301] => ATLASM [302] => ASA [303] => NL [304] => Agent [305] => Customer [306] => 1010179 [307] => 9195248 [308] => 6903395 [309] => 9147215 [310] => 9607435 [311] => 9192741 [312] => 5100532 [313] => 1010117 [314] => 9669471 [315] => 9551480 [316] => 8906169 [317] => 6725066 [318] => 743807 [319] => 9262003 [320] => 9773870 [321] => 6903981 [322] => 9235490 [323] => IMO # [324] => Vsl [325] => Cd [326] => 14 [327] => 21 [328] => 01 [329] => 14 [330] => 21 [331] => 28 [332] => 07 [333] => 21 [334] => 04 [335] => 29 [336] => 01 [337] => 21 [338] => 21 [339] => 14 [340] => 18 [341] => 18 [342] => 21 [343] => 18 [344] => 01 [345] => 21 [346] => 21 [347] => 07 [348] => 21 [349] => 18 [350] => 21 [351] => 18 [352] => 28 [353] => 01 [354] => 14 [355] => 50 [356] => 1,090 [357] => 2,545 [358] => 251 [359] => 1,141 [360] => 9,528 [361] => 18 [362] => 23,843 [363] => 423 [364] => 1,503 [365] => 484 [366] => 359 [367] => 331 [368] => 810 [369] => 458 [370] => 4,462 [371] => 299 [372] => 4,605 [373] => 6,382 [374] => Gross [375] => Ton [376] => 2009 [377] => 2001 [378] => 1984 [379] => 1968 [380] => 1998 [381] => 1997 [382] => 1999 [383] => 1956 [384] => 2011 [385] => 2007 [386] => 2013 [387] => 2011 [388] => 1989 [389] => 1967 [390] => 2002 [391] => 2003 [392] => 2007 [393] => 1969 [394] => 2000 [395] => Yr [396] => Blt [397] => 11/06 [398] => 16/00 [399] => 02/00 [400] => 22/08 [401] => 09/00 [402] => 10/08 [403] => 09/00 [404] => 19/06 [405] => 05/00 [406] => 13/00 [407] => 13/02 [408] => 05/00 [409] => 10/00 [410] => Max [411] => TFW [412] => Visit No. [413] => 178234 [414] => 229018 [415] => 231510 [416] => 232797 [417] => 234001 [418] => 236363 [419] => 239894 [420] => 241625 [421] => 244750 [422] => 245363 [423] => 247804 [424] => 247917 [425] => 247661 [426] => 249568 [427] => 264857 [428] => 264860 [429] => 265640 [430] => 267371 [431] => 278035 [432] => 283425 [433] => 298998 [434] => 302484 [435] => 304483 [436] => 308604 [437] => 314107 [438] => 320699 [439] => 327002 [440] => 327464 [441] => 328501 [442] => 331162 [443] => PMX+ )
Code
// Include Composer autoloader if not already done. include 'pdfparser/vendor/autoload.php'; // Parse pdf file and build necessary objects. $config = new \Smalot\PdfParser\Config(); $config->setIgnoreEncryption(true); $config->setPdfWhitespaces='\f\r'; /**
'; }