Open shula opened 4 months ago
Hi @shula Unfortunately this is expected behavior for a PDF with this kind of problem. The "extra"/unexpected characters (for example AL YPT
in line 1068) are present, but under the text for the next cell to the left. So Tabula is correctly extracting the characters.
When 2 of the cells in the PDF continue beyond the cell's boundary, the next cell's content goes "crazy" (i.e. is totally different than expected)
in the example sample:
I assume the PDF source is EXCEL, where it's common to see long text cut at the border of the cell. I don't know for sure.
Command line used:
java -Dfile.encoding=UTF8 -jar tabula-1.0.5-jar-with-dependencies.jar sample.pdf -f TSV > sample.tsv
The bogus lines are identified / starts with: 1068, 1103 Output lines with the problem:
43 E2U9 A10L YCPCT "ש""א אקליפטוס סיטריאדורה SCITRIADORA/" 1068
60 43 10 CEUCC "ש""א אקליפטוס רדיאטה LYPTUSRADIATA/" 1103
In the output, i see 2 phenomena:
in the attache sample.df > converted text file in the 3rd field shoud've been the text "10 CC".
My setup: