smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Error when obtaining array from a PDF #681

Open andresflorez12 opened 4 months ago

andresflorez12 commented 4 months ago

Expected output & actual output

Just small fragments [42] => Time,[41] => ETA/ARR [11] => 20-JUL-2010 0900 [12] => 07-DEC-2012 0045 [13] => 30-JAN-2013 0700 [14] => 23-FEB-2013 1000 [15] => 21-MAR-2013 1730 [16] => 09-MAY-2013 0820 [17] => 27-JUL-2013 2250 [18] => 02-SEP-2013 0700 [19] => 13-NOV-2013 1942 [20] => 20-NOV-2013 2052 [21] => 15-JAN-2014 1155 [22] => 17-JAN-2014 0345 [23] => 18-JAN-2014 1200 [24] => 16-APR-2014 1200 [25] => 20-JAN-2015 1700 [26] => 20-JAN-2015 1700 [27] => 07-FEB-2015 1824 [28] => 12-MAR-2015 1237 [29] => 20-OCT-2015 1115 [30] => 19-FEB-2016 1900 [31] => 17-JAN-2017 0850 [32] => 21-MAR-2017 1710 [33] => 26-APR-2017 0700 [34] => 13-JUL-2017 0800 [35] => 26-OCT-2017 0700 [36] => 19-FEB-2018 0330 [37] => 10-JUN-2018 1740 [38] => 02-JUL-2018 0600 [39] => 10-JUL-2018 0710 [40] => 29-AUG-2018 0730 [323] => IMO # [305]=>'' [306] => 1010179 [307] => 9195248 [307A]=>'' [307B]=>'' [308] => 6903395 [309] => 9147215 [310] => 9607435 [311] => 9192741 [312] => 5100532 [313] => 1010117 [314] => 9669471 [315] => 9551480 [316] => 8906169 [317] => 6725066 [318] => 743807 [319] => 9262003 [320] => 9773870 [321] => 6903981 [322] => 9235490 ACTUAL OUTPUT Array ( [0] => Arrivals [1] => Page [2] => 1 [3] => Enhanced Vessel Traffic Management System [4] => Prepared on: [5] => Report Id: [6] => SY5700RP [7] => Run by: [8] => REPORTS [9] => 05-FEB-2024 1130 [10] => Atlantic [11] => 20-JUL-2010 0900 [12] => 07-DEC-2012 0045 [13] => 30-JAN-2013 0700 [14] => 23-FEB-2013 1000 [15] => 21-MAR-2013 1730 [16] => 09-MAY-2013 0820 [17] => 27-JUL-2013 2250 [18] => 02-SEP-2013 0700 [19] => 13-NOV-2013 1942 [20] => 20-NOV-2013 2052 [21] => 15-JAN-2014 1155 [22] => 17-JAN-2014 0345 [23] => 18-JAN-2014 1200 [24] => 16-APR-2014 1200 [25] => 20-JAN-2015 1700 [26] => 20-JAN-2015 1700 [27] => 07-FEB-2015 1824 [28] => 12-MAR-2015 1237 [29] => 20-OCT-2015 1115 [30] => 19-FEB-2016 1900 [31] => 17-JAN-2017 0850 [32] => 21-MAR-2017 1710 [33] => 26-APR-2017 0700 [34] => 13-JUL-2017 0800 [35] => 26-OCT-2017 0700 [36] => 19-FEB-2018 0330 [37] => 10-JUN-2018 1740 [38] => 02-JUL-2018 0600 [39] => 10-JUL-2018 0710 [40] => 29-AUG-2018 0730 [41] => ETA/ARR [42] => Time [43] => [44] => [45] => [46] => [47] => [48] => [49] => [50] => [51] => [52] => [53] => [54] => [55] => [56] => [57] => [58] => [59] => [60] => [61] => [62] => [63] => [64] => [65] => [66] => [67] => [68] => [69] => [70] => [71] => [72] => Bk Date [73] => 3009758 [74] => 6007662 [75] => 3002206 [76] => 0354104 [77] => 3013206 [78] => 0368156 [79] => 0384917 [80] => 3013608 [81] => 3013713 [82] => 3008829 [83] => 3000660 [84] => 3013815 [85] => 3009984 [86] => 3014090 [87] => 3013301 [88] => 6006991 [89] => 3015105 [90] => 0770795 [91] => 0235661 [92] => 3016136 [93] => 3019329 [94] => 6002381 [95] => 3020297 [96] => 6014554 [97] => 3009692 [98] => 6015644 [99] => 3006914 [100] => 3015287 [101] => 0272787 [102] => 6016569 [103] => SIN [104] => [105] => DN 136 [106] => ELANDESS [107] => MAREIKE B [108] => MATT II [109] => BLACK SHEEP [110] => SONNY [111] => STADT DUESSELDORF [112] => SENTA [113] => MILITOS [114] => ATLANTIC LIGURIA [115] => CANDELA V [116] => IMAGINE [117] => ORINOQUIA I [118] => CONCEPCION [119] => CERRO ITAMUT [120] => PARITA I [121] => T/T BLACK JACK [122] => VB CALIFORNIA [123] => ADONAI [124] => CACIQUE [125] => DWS XPRESS [126] => SINA [127] => ABOUT TIME [128] => ARCANGEL SAN RAFAEL [129] => MAMMA MIA [130] => ARCANGEL SAN GABRIEL [131] => GREAT PORTOBELLO [132] => HC SVEA KIM [133] => ICB - 01 [134] => OÑI LEKUN [135] => Name [136] => 34.12 [137] => 196.85 [138] => 283.46 [139] => 120.01 [140] => 49.11 [141] => 244.09 [142] => 480.54 [143] => 40.85 [144] => 899.54 [145] => 600.39 [146] => 169.95 [147] => 214.90 [148] => 36.58 [149] => 70.00 [150] => 94.82 [151] => 89.90 [152] => 36.09 [153] => 105.31 [154] => 194.69 [155] => 151.71 [156] => 34.45 [157] => 328.05 [158] => 39.37 [159] => 95.28 [160] => 117.59 [161] => 95.47 [162] => 400.50 [163] => 424.70 [164] => 120.20 [165] => 104.99 [166] => Length [167] => 32.87 [168] => 36.19 [169] => 42.72 [170] => 45.57 [171] => 14.70 [172] => 33.53 [173] => 75.46 [174] => 11.93 [175] => 164.16 [176] => 89.99 [177] => 27.66 [178] => 40.88 [179] => 9.97 [180] => 48.26 [181] => 45.93 [182] => 40.03 [183] => 9.84 [184] => 30.35 [185] => 34.81 [186] => 28.35 [187] => 10.50 [188] => 62.32 [189] => 13.12 [190] => 41.34 [191] => 26.41 [192] => 43.41 [193] => 54.09 [194] => 52.79 [195] => 50.00 [196] => 39.37 [197] => Beam [198] => D [199] => HML [200] => CC [201] => Rest [202] => N [203] => H [204] => N [205] => N [206] => N [207] => 7 [208] => N [209] => N [210] => H [211] => 1 [212] => N [213] => N [214] => N [215] => N [216] => N [217] => N [218] => N [219] => N [220] => H [221] => N [222] => N [223] => N [224] => N [225] => N [226] => N [227] => N [228] => 7 [229] => H [230] => N [231] => N [232] => Pd [233] => S12GA [234] => Sched [235] => No. [236] => Y [237] => N [238] => N [239] => N [240] => N [241] => N [242] => N [243] => N [244] => N [245] => N [246] => N [247] => N [248] => Y [249] => N [250] => Y [251] => Y [252] => N [253] => N [254] => N [255] => N [256] => N [257] => N [258] => N [259] => N [260] => N [261] => N [262] => N [263] => N [264] => N [265] => N [266] => Tr [267] => Flg [268] => HRM [269] => CPC [270] => HRM+ [271] => HRM [272] => HRM [273] => Hold P [274] => 22-JUL-2010 0828* [275] => First Lock Time [276] => Depart Last Lock [277] => ASA [278] => ASA [279] => PA [280] => GATE [281] => AGENSA [282] => SEASAG [283] => CENTCO [284] => FERNIE [285] => INCH [286] => TINAMC [287] => STANLE [288] => PCC [289] => PCC [290] => ROZO [291] => ASA [292] => INTCAR [293] => STWARD [294] => MASTER [295] => COSCO [296] => CENTCO [297] => ATLASM [298] => MASTER [299] => ATLASM [300] => ONIX [301] => ATLASM [302] => ASA [303] => NL [304] => Agent [305] => Customer [306] => 1010179 [307] => 9195248 [308] => 6903395 [309] => 9147215 [310] => 9607435 [311] => 9192741 [312] => 5100532 [313] => 1010117 [314] => 9669471 [315] => 9551480 [316] => 8906169 [317] => 6725066 [318] => 743807 [319] => 9262003 [320] => 9773870 [321] => 6903981 [322] => 9235490 [323] => IMO # [324] => Vsl [325] => Cd [326] => 14 [327] => 21 [328] => 01 [329] => 14 [330] => 21 [331] => 28 [332] => 07 [333] => 21 [334] => 04 [335] => 29 [336] => 01 [337] => 21 [338] => 21 [339] => 14 [340] => 18 [341] => 18 [342] => 21 [343] => 18 [344] => 01 [345] => 21 [346] => 21 [347] => 07 [348] => 21 [349] => 18 [350] => 21 [351] => 18 [352] => 28 [353] => 01 [354] => 14 [355] => 50 [356] => 1,090 [357] => 2,545 [358] => 251 [359] => 1,141 [360] => 9,528 [361] => 18 [362] => 23,843 [363] => 423 [364] => 1,503 [365] => 484 [366] => 359 [367] => 331 [368] => 810 [369] => 458 [370] => 4,462 [371] => 299 [372] => 4,605 [373] => 6,382 [374] => Gross [375] => Ton [376] => 2009 [377] => 2001 [378] => 1984 [379] => 1968 [380] => 1998 [381] => 1997 [382] => 1999 [383] => 1956 [384] => 2011 [385] => 2007 [386] => 2013 [387] => 2011 [388] => 1989 [389] => 1967 [390] => 2002 [391] => 2003 [392] => 2007 [393] => 1969 [394] => 2000 [395] => Yr [396] => Blt [397] => 11/06 [398] => 16/00 [399] => 02/00 [400] => 22/08 [401] => 09/00 [402] => 10/08 [403] => 09/00 [404] => 19/06 [405] => 05/00 [406] => 13/00 [407] => 13/02 [408] => 05/00 [409] => 10/00 [410] => Max [411] => TFW [412] => Visit No. [413] => 178234 [414] => 229018 [415] => 231510 [416] => 232797 [417] => 234001 [418] => 236363 [419] => 239894 [420] => 241625 [421] => 244750 [422] => 245363 [423] => 247804 [424] => 247917 [425] => 247661 [426] => 249568 [427] => 264857 [428] => 264860 [429] => 265640 [430] => 267371 [431] => 278035 [432] => 283425 [433] => 298998 [434] => 302484 [435] => 304483 [436] => 308604 [437] => 314107 [438] => 320699 [439] => 327002 [440] => 327464 [441] => 328501 [442] => 331162 [443] => PMX+ )

Code

// Include Composer autoloader if not already done. include 'pdfparser/vendor/autoload.php'; // Parse pdf file and build necessary objects. $config = new \Smalot\PdfParser\Config(); $config->setIgnoreEncryption(true); $config->setPdfWhitespaces='\f\r'; /**

vinceDeNoisy commented 2 months ago

PHP Version : 8.2.0 PDFParser Version: v2.9.0

Exactly the same issue : It was working until I had to update PDFParser for PHP 8.

ep4_1_2024-02.pdf

$parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($fileName); $page = $pdf->getPages()[2];

echo '
getText
'; print_r($page->getText()); // expected output : a different separator for lines and words (typically "\n" and ”\t” or ” "), // actual output : "\n” and " " between each field => impossible to parse lines

echo '
getTextArray
'; print_r($page->getTextArray()); // expected output : array with the same structure as the pdf // actual output : an array with one different value for each word

echo '
getDataTm
'; $data = $page->getDataTm(); foreach($data as $k => $td){ $text=$td[1]; if(!trim($text))continue; echo 'text'.$text.'
'; echo 'transformation matrix = ('.$td[0][0].','.$td[0][1].','.$td[0][2].','.$td[0][3].')
'; echo 'position x='.$td[0][4].' y='.$td[0][5].'
'; } // expected output : position different for each text element // actual output : same position for every text element