tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.84k stars 428 forks source link

Column mix-up: Column boundaries seem to be generated for each page separately although result is actually one CSV #134

Open stink0r opened 7 years ago

stink0r commented 7 years ago

When scanning a large PDF file, and using the auto-detection of columns, but defining the area via "-a", the command line tool does not consider all columns, especially when some of them might be empty at some pages. This results in mixing up columns (commas are not written in resulting CSV) which actually destroys the structure of the generated CSV. It appears that the tool uses different column boundaries for each table of each site of the PDF; but that does not make sense since there should also be ONE resulting CSV. As a result, the same column boundaries have to be used for each page.

It should work as follows: Scan each page at the specified area, and auto-detect the max-number of columns. Then internally build the columns with "-c" parameters, and specify all columns that have EVER been seen on any of the pages. So if one page just fills 10 columns, but the next one fills 15, and the following one fills 12, we still need to always read 15 columns - otherwise the structure is broken. (A filled column is a column containing data, so non-empty column.)

Try this: Download the PDF from here: http://www.kba.de/DE/Fahrzeugtechnik/Fahrzeugtypdaten_amtlDaten_TGV/Auskuenfte_Informationen/Veroeffentlichungen/SV2.html?nn=669132

Then parse it with following command (created by tabula web): java -jar tabula-0.9.1-jar-with-dependencies.jar -n -p 18-1261 -a 810,0,145,1200 -c 43,227,255,300,470,500,520,600,660,720,770,830,890,950,1000,1060,1110,1180 "sv221_m1_schad_pdf.pdf" > "schad_manual.csv"

Then parse it with my self-created area and columns (which works even better): java -jar tabula-0.9.1-jar-with-dependencies.jar -n -p 18-1261 -a 163.646,38.237,778.909,1155.791 "sv221_m1_schad_pdf.pdf" > "schad_tabula-web.csv"

... and compare the files. Filter for "ADQ" in the third column, and take a look at the last three lines (53197, 53198, 53199). You will see serious differences WHERE the data is placed in its columns. In the correct example, the data is in column K: image

... in the wrong one, it is in column J - so columns are mixed up: image

I'd be happy to provide you with more information in case the issue is unclear. Thank you.

jeremybmerrill commented 7 years ago

Hi @sYbb12 thanks for your report. It's totally clear. I think this would be a worthwhile enhancement to Tabula to perhaps add a flag saying that the entire PDF should be considered one table for the detection of columns.

(Or perhaps you have another idea of how to implement this from an API/user interface perspective? Keeping in mind that this cannot be the default behavior, because many PDFs include multiple tables with distinct semantics.)

In any case, the behavior you describe is, in a narrow sense, expected behavior, since Tabula is designed to treat each page as a separate table, unrelated to the others. But coming up with a way to change that might be worthwhile.

stink0r commented 7 years ago

Hi Jeremy,

thanks for your response. I forgot the circumstance that it's "normal" having multiple tables in other PDFs with other column boundaries. I agree on your suggestion adding a parameter, e.g. "considerAllTablesAsOne=true".

An alternative would be to auto-detect the number of tables by checking if the structure changes from PDF site to the following PDF site. If there is e.g. text or images in between, then it's probable that there are multiple tables. If the tables continues over pages (as in my example) without interruptions, then it could be considered as one table until the last page where text, images or whatever occurs again.

I'd go for the auto-solution in case you find a reliable way of detecting the start and the end of a table. With (custom) command line parameters, you always have the problem that user don't easily find them.

Best, sYbb