tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.85k stars 430 forks source link

unable to extract tables from PDF #343

Open himsoni-cloud opened 4 years ago

himsoni-cloud commented 4 years ago

Hi

I am getting t subprocess error while using tabula-py library to extract tables from PDF. I have coordinated with tabula-py group and they told me "this is not tabula-py's issue but tabula-java's one."

(.venv) ➜  tabula-py git:(master) ✗ java -jar tabula/tabula-1.0.3-jar-with-dependencies.jar Testing.pdf
Error: Error expected floating point number actual='-17.-21823'

Could you please take a look.

attacehed pdf file for your reference Testing.pdf

chezou commented 4 years ago

Made a further investigation and this error came from pdfbox, which tabula-java depends on. So, it'd be better to raise an issue on PDFBox.

java -jar pdfbox-app-2.0.18.jar ExtractText Testing.pdf
Exception in thread "main" java.io.IOException: Error expected floating point number actual='-17.-21823'
    at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:78)
    at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:115)
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:952)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216)
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867)
    at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:917)
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:886)
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:806)
    at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:766)
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1023)
    at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:218)
    at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)
    at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
Caused by: java.lang.NumberFormatException
    at java.math.BigDecimal.<init>(BigDecimal.java:494)
    at java.math.BigDecimal.<init>(BigDecimal.java:383)
    at java.math.BigDecimal.<init>(BigDecimal.java:806)
    at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59)
    ... 18 more
mkl-public commented 4 years ago

Your PDF contains this font descriptor object:

17 0 obj
<</Ascent 891 /CapHeight 662 /Descent -216 /Flags 32 /FontBBox
  [-497 -306 1120 1023] /FontFile2 18 0 R /FontName
  /AFPTimesNewRoman-Italic /ItalicAngle -17.-21823 /StemV 80 /Type
  /FontDescriptor /XHeight 441>>
endobj

According to the PDF specification the ItalicAngle must be a number. -17.-21823 is not a valid number representation. PDF parsers which don't do repairs under the hood, therefore, most likely will fail reading your file. PDFBox does fail.