tabulapdf / tabula-extractor

Extract tables from PDF files
MIT License
354 stars 57 forks source link

Issues from cell spanning multiple rows #86

Open jtbates opened 10 years ago

jtbates commented 10 years ago

I have PDFs from Indonesian election results that I am attempting to parse to CSVs. These contain spreadsheets where a cell may span multiple rows:

screen shot 2014-06-23 at 5 56 09 pm

I used the following command with tabula-extractor:

$ tabula DD-1_-_DPR_-_9201_-_PAPUA_BARAT.pdf -p all -r -o DD-1_-_DPR_-_9201_-_PAPUA_BARAT.csv

The row spanning cells seem to be causing a couple problems. Output for reference:

screen shot 2014-06-23 at 5 55 41 pm

The first problem is that for the cells that span multiple rows the text after the first line is discarded. This can be seen in the selected cell in the picture: Tetap (DPT) is missing. Similarly Tambahan (DPTb) is missing for the next cell and so forth.

The second problem is that the row below is sometimes split. This seems to happen once or twice but not thereafter. In this example, rows 7 and 8 should be joined. This can be seen more clearly in the CSV output (lines 7-9):

"",""
PR,"82,484","79,536","38,602","17,395","16,796","9,965","14,300","16,500","24,532","21,368","10,751","332,229"
"","",JML,"174,769","165,250","86,097","38,185","34,895","21,874","28,869","35,937","50,255","49,179","23,791","709,101"

Here is the PDF I used in this example and here is the output from tabula-extractor.

jeremybmerrill commented 10 years ago

Hi Jordan,

Sorry again for the delay in getting back to you. You found some nice bugs here! Thanks!

I've figured out the source of the first problem; for better or for worse, the bug is more philosophical than technical.

Tabula's "spreadsheet" extraction method uses vector lines to attempt to recreate the structure of the table; since PDFs only have lines, with no conception of tables or relationships between lines. Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6 ( 1 Jumlah pemilih terdaftar dalam Daftar Pemilih PR Tetap (DPT) ) at the same height as the line separating PR and JML in column 3. (And since the line crosses the Tetap (DPT) text, the text isn't included in either cell and therefore ends up ignored.)

The way to fix this is to tell Tabula to ignore non-black lines. This is all built out, but isn't present in the script in bin/ -- since it's weird, hard to describe and hard to tune. I could probably send you a substitute file that'd include that option, or maybe add it as an undocumented feature. (@jazzido, what do you think about the options for surfacing the line_color_filter thing?) I think it's too complicated for a command-line option -- or at least, I don't know how to represent a range of RGB colors on the command line and describe that method in an intuitive way.

The second problem was sort of related, but is an actual technological bug. The "split" line is actually at two different y-axis locations: 114.0199966430664 for the first two (empty) cells) and 114.02000427246094 for the rest. We need to round... because floating point numbers are dumb. That patch is 9b650f4

With both changes, here's the CSV: I think it looks much better. https://gist.githubusercontent.com/jeremybmerrill/d624986d48c81fde2d29/raw/06fd9284a774f1a9175a382f012ec2dbd076373b/papua.csv

jazzido commented 10 years ago

Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6 ( 1 Jumlah pemilih terdaftar dalam Daftar Pemilih PR Tetap (DPT) )

Just for the record, tabula-java —which is soon going to become tabula-extractor's engine—, includes a tool to debug these kind of issues.

The command java -cp tabula-extractor-0.7.4-SNAPSHOT-jar-with-dependencies.jar org.nerdpower.tabula.debug.Debug --rulings -p 1 DD-1_-_DPR_-_9201_-_PAPUA_BARAT.pdf generates this output image:

dd-1_-_dpr_-_9201_-_papua_barat-1

...which clearly shows what @jeremybmerrill described.

jeremybmerrill commented 10 years ago

Hey @jtbates, any thoughts on how to implement the solution I mention above? Would love to get this problme solved for you.

umesh-kalia commented 5 years ago

Tabula to ignore non-black lines

Hello,

Can you please send me solution/command to convert PDF to csv/excel to ignore non-black lines?

Thanks,