Open himsoni-cloud opened 4 years ago
Made a further investigation and this error came from pdfbox, which tabula-java depends on. So, it'd be better to raise an issue on PDFBox.
java -jar pdfbox-app-2.0.18.jar ExtractText Testing.pdf
Exception in thread "main" java.io.IOException: Error expected floating point number actual='-17.-21823'
at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:78)
at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:115)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:952)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867)
at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:917)
at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:886)
at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:806)
at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:766)
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1023)
at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:218)
at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
Caused by: java.lang.NumberFormatException
at java.math.BigDecimal.<init>(BigDecimal.java:494)
at java.math.BigDecimal.<init>(BigDecimal.java:383)
at java.math.BigDecimal.<init>(BigDecimal.java:806)
at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59)
... 18 more
Your PDF contains this font descriptor object:
17 0 obj
<</Ascent 891 /CapHeight 662 /Descent -216 /Flags 32 /FontBBox
[-497 -306 1120 1023] /FontFile2 18 0 R /FontName
/AFPTimesNewRoman-Italic /ItalicAngle -17.-21823 /StemV 80 /Type
/FontDescriptor /XHeight 441>>
endobj
According to the PDF specification the ItalicAngle must be a number. -17.-21823
is not a valid number representation. PDF parsers which don't do repairs under the hood, therefore, most likely will fail reading your file. PDFBox does fail.
Hi
I am getting t subprocess error while using tabula-py library to extract tables from PDF. I have coordinated with tabula-py group and they told me "this is not tabula-py's issue but tabula-java's one."
Could you please take a look.
attacehed pdf file for your reference Testing.pdf