tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.77k stars 412 forks source link

when cell content exceeds cell boundaries, next cell gets messed up (exmples) #538

Open shula opened 4 months ago

shula commented 4 months ago

When 2 of the cells in the PDF continue beyond the cell's boundary, the next cell's content goes "crazy" (i.e. is totally different than expected)

in the example sample:

I assume the PDF source is EXCEL, where it's common to see long text cut at the border of the cell. I don't know for sure.

Command line used: java -Dfile.encoding=UTF8 -jar tabula-1.0.5-jar-with-dependencies.jar sample.pdf -f TSV > sample.tsv

The bogus lines are identified / starts with: 1068, 1103 Output lines with the problem: 43 E2U9 A10L YCPCT "ש""א אקליפטוס סיטריאדורה SCITRIADORA/" 1068 60 43 10 CEUCC "ש""א אקליפטוס רדיאטה LYPTUSRADIATA/" 1103

In the output, i see 2 phenomena:

  1. the wrong text "A10L YCPCT" should've been: "10 CC"
  2. the wrong text "E209" should've been: "29". etc.
  3. the word "EUCALIPTUS" is cut in these lines. This makes sense, since it's not visible, and therefore, not a real bug.

in the attache sample.df > converted text file in the 3rd field shoud've been the text "10 CC".

My setup:

jeremybmerrill commented 4 months ago

Hi @shula Unfortunately this is expected behavior for a PDF with this kind of problem. The "extra"/unexpected characters (for example AL YPT in line 1068) are present, but under the text for the next cell to the left. So Tabula is correctly extracting the characters.