plutext / docx4j

JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files
https://www.docx4java.org/
2.1k stars 1.2k forks source link

NumberFormatException when extracting text from docx file #148

Closed akostajti closed 5 years ago

akostajti commented 9 years ago

I'm extracting text from a docx file using TextUtils.extractText(Object o, Writer w). For a certain document (generated with an older version fo google docs) I get this exception:

2015-06-21 05:55:14,999 ERROR openpackaging.parts.JaxbXmlPartXPathAware - For input string: "9360.0" [DefaultQuartzScheduler_Worker-10] {} java.lang.NumberFormatException: For input string: "9360.0" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.math.BigInteger.<init>(BigInteger.java:338) at java.math.BigInteger.<init>(BigInteger.java:476) at com.sun.xml.internal.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:72) at com.sun.xml.internal.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$21.parse(RuntimeBuiltinLeafInfoImpl.java:766) at com.sun.xml.internal.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$21.parse(RuntimeBuiltinLeafInfoImpl.java:764) at com.sun.xml.internal.bind.v2.runtime.reflect.TransducedAccessor$CompositeTransducedAccessorImpl.parse(TransducedAccessor.java:230) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StructureLoader.startElement(StructureLoader.java:194) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:486) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:465) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:60) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector.startElement(SAXConnector.java:135) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:229) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:266) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:235) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:266) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:235) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:266) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:235) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:266) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:235) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:112) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:95) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:88) at com.sun.xml.internal.bind.v2.runtime.BinderImpl.associativeUnmarshal(BinderImpl.java:146) at com.sun.xml.internal.bind.v2.runtime.BinderImpl.unmarshal(BinderImpl.java:117) at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unwrapUsually(JaxbXmlPartXPathAware.java:283) at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:333) at org.docx4j.openpackaging.parts.JaxbXmlPart.getContents(JaxbXmlPart.java:147)

Is there a way to prevent this exception

?

plutext commented 9 years ago

Please put the docx somewhere I can look at it.

akostajti commented 9 years ago

sorry, I forgot it. here you can download the file: https://drive.google.com/file/d/0B6qA3QZEFwTKaXdlNE9PRGJhRVU/view?usp=sharing.

lukateras commented 7 years ago

@plutext Any updates? I had a very similar issue with the latest version of docx4j:

Unhandled java.lang.NumberFormatException
   For input string: "9576.0"

NumberFormatException.java:   65  java.lang.NumberFormatException/forInputString
              Integer.java:  580  java.lang.Integer/parseInt
           BigInteger.java:  470  java.math.BigInteger/<init>
           BigInteger.java:  606  java.math.BigInteger/<init>
DatatypeConverterImpl.java:   76  com.sun.xml.internal.bind.DatatypeConverterImpl/_parseInteger
RuntimeBuiltinLeafInfoImpl.java:  779  com.sun.xml.internal.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22/parse
RuntimeBuiltinLeafInfoImpl.java:  777  com.sun.xml.internal.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22/parse
   TransducedAccessor.java:  230  com.sun.xml.internal.bind.v2.runtime.reflect.TransducedAccessor$CompositeTransducedAccessorImpl/parse
      StructureLoader.java:  195  com.sun.xml.internal.bind.v2.runtime.unmarshaller.StructureLoader/startElement
 UnmarshallingContext.java:  559  com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext/_startElement
 UnmarshallingContext.java:  538  com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext/startElement
         SAXConnector.java:  153  com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector/startElement
    AbstractSAXParser.java:  509  com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser/startElement
AbstractXMLDocumentParser.java:  182  com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser/emptyElement
XMLNSDocumentScannerImpl.java:  351  com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl/scanStartElement
XMLDocumentFragmentScannerImpl.java: 2784  com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver/next
XMLDocumentScannerImpl.java:  602  com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl/next
XMLNSDocumentScannerImpl.java:  112  com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl/next
XMLDocumentFragmentScannerImpl.java:  505  com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl/scanDocument
   XML11Configuration.java:  841  com.sun.org.apache.xerces.internal.parsers.XML11Configuration/parse
   XML11Configuration.java:  770  com.sun.org.apache.xerces.internal.parsers.XML11Configuration/parse
            XMLParser.java:  141  com.sun.org.apache.xerces.internal.parsers.XMLParser/parse
    AbstractSAXParser.java: 1213  com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser/parse
        SAXParserImpl.java:  643  com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser/parse
     UnmarshallerImpl.java:  243  com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl/unmarshal0
     UnmarshallerImpl.java:  214  com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl/unmarshal
AbstractUnmarshallerImpl.java:  157  javax.xml.bind.helpers.AbstractUnmarshallerImpl/unmarshal
AbstractUnmarshallerImpl.java:  125  javax.xml.bind.helpers.AbstractUnmarshallerImpl/unmarshal
             XmlUtils.java:  540  org.docx4j.XmlUtils/unmarshalString
             XmlUtils.java:  589  org.docx4j.XmlUtils/unmarshallFromTemplate
          JaxbXmlPart.java:  266  org.docx4j.openpackaging.parts.JaxbXmlPart/variableReplace
NativeMethodAccessorImpl.java:   -2  sun.reflect.NativeMethodAccessorImpl/invoke0
NativeMethodAccessorImpl.java:   62  sun.reflect.NativeMethodAccessorImpl/invoke
DelegatingMethodAccessorImpl.java:   43  sun.reflect.DelegatingMethodAccessorImpl/invoke
               Method.java:  498  java.lang.reflect.Method/invoke
            Reflector.java:   93  clojure.lang.Reflector/invokeMatchingMethod
            Reflector.java:   28  clojure.lang.Reflector/invokeInstanceMethod
            ...
plutext commented 7 years ago

Please post your docx at http://ndoc.it

Which version of docx4j?

Generally such issues are handled by the code at https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/jaxb/mc-preprocessor.xslt#L89

tingley commented 5 years ago

Another example attached.

border.docx.zip

In this case, it's triggered by the decimal value of 1.8 in w:space:

        <w:pBdr>
          <w:top w:sz="7" w:space="1.8" w:color="#333437" w:val="single"/>
          <w:left w:sz="7" w:space="0" w:color="#000000" w:val="single"/>
          <w:bottom w:sz="3" w:space="7.2" w:color="#323539" w:val="double"/>
          <w:right w:sz="7" w:space="0" w:color="#000000" w:val="single"/>
        </w:pBdr>

According to the schema, w:space should be of type ST_PointMeasure, and docx4j parses it as a BigInteger. So this document may actually be schematically invalid. However, tools open it fine (LibreWriter silently corrects the value; I haven't tested in Word). I do not know what tool generated this document.

Stack trace follows.

Caused by: java.lang.NumberFormatException: For input string: "1.8"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_212]
    at java.lang.Integer.parseInt(Integer.java:580) ~[?:1.8.0_212]
    at java.math.BigInteger.<init>(BigInteger.java:470) ~[?:1.8.0_212]
    at java.math.BigInteger.<init>(BigInteger.java:606) ~[?:1.8.0_212]
    at com.sun.xml.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:91) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22.parse(RuntimeBuiltinLeafInfoImpl.java:800) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22.parse(RuntimeBuiltinLeafInfoImpl.java:798) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.runtime.reflect.TransducedAccessor$CompositeTransducedAccessorImpl.parse(TransducedAccessor.java:245) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.runtime.unmarshaller.StructureLoader.startElement(StructureLoader.java:212) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:577) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:556) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:75) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.runtime.unmarshaller.SAXConnector.startElement(SAXConnector.java:168) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:244) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:127) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:110) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:103) ~[jaxb-core-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.runtime.BinderImpl.associativeUnmarshal(BinderImpl.java:161) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at com.sun.xml.bind.v2.runtime.BinderImpl.unmarshal(BinderImpl.java:132) ~[jaxb-runtime-2.3.0.jar:2.3.0]
    at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:574) ~[docx4j-6.0.1.jar:?]
    at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:355) ~[docx4j-6.0.1.jar:?]
    at org.docx4j.openpackaging.parts.JaxbXmlPart.getContents(JaxbXmlPart.java:194) ~[docx4j-6.0.1.jar:?]
    ... 27 more
plutext commented 5 years ago

Should be fixed by https://github.com/plutext/docx4j/commit/bc652c5bf945a8c62b18d1f02f16d3571d0ba677

Will be in a new release this week.

Anybody else who encounters a similar issue but on some other attribute, please open your own issue, clearly showing what XML structure is at issue.