tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.81k stars 423 forks source link

Appending a zero to some integers #80

Closed jwhulette closed 8 years ago

jwhulette commented 8 years ago

When I extract the data from the PDF som of the data has a zero appended to it.

Output: "" "01/02/2014 16:49:50 " "Lee County Toll System " "Page " "1 " "of " 1 "" DAILY LANE TRAFFIC REPORT "" Cape Coral "" Revenue Day: 01/01/2014 "Hour " "Lane 1 " "Lane " "2 Lane 3 Lane " "4 " "Lane 5 " Total "0 " "1720 " "1260 110 " "0 " "1110 " 519 "1 " "1880 " "1540 1540 " "0 " "1170 " 613 "2 " "1010 " "690 890 " "0 " "620 " 321 "3 " "340 " "410 460 " "0 " "330 " 154 "4 " "20 " "210 270 " "0 " "240 " 92 "5 " "160 " "340 70 " "0 " "250 " 82 "6 " "270 " "240 30 " "0 " "40 " 121 "7 " "840 " "590 330 " "0 " "840 " 260 "8 " "680 " "1050 0 " "0 " "610 " 234 "9 " "690 " "920 480 " "0 " "80 " 289 "10 " "1110 " "110 1230 " "0 " "1120 " 456 "11 " "1580 " "120 1410 " "0 " "1730 " 592 "12 " "1560 " "1830 2070 " "0 " "2380 " 784 "13 " "190 " "1990 220 " "0 " "2520 " 861 "" 0 "14 " "2580 " "2270 250 " "0 " "3270 " 1062 "15 " "3160 " "2360 270 " "0 " "3830 " 1205 "16 " "2910 " "2230 2040 " "1910 " "3550 " 1264 "17 " "320 " "2190 2070 " "1830 " "3310 " 1260 "" 0 "18 " "2890 " "2130 2050 " "1580 " "2870 " 1152 "19 " "2340 " "2010 2330 " "0 " "2990 " 967 "20 " "1760 " "170 1790 " "0 " "2210 " 746 "21 " "1390 " "1220 1430 " "0 " "1950 " 599 "22 " "1210 " "90 980 " "0 " "1360 " 445 "23 " "830 " "590 730 " "0 " "780 " 293 "Subtotal " "3621 " "3097 " "3097 " "532 " "4024 " 14371

Source file: 0201012014DLTRF.PDF

leeper commented 8 years ago

It looks like that PDF just has some weird embedded, non-visible 0 characters. If you copy-paste the text of the table or run a pdftotext operation on the page, it reveals a big matrix of zeroes that seem to underlie the table.

jwhulette commented 8 years ago

Thank you for the feedback. So it just a crappy produced PDF. Love local government.