plutext / docx4j

JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files
https://www.docx4java.org/
2.11k stars 1.2k forks source link

Incorrect processing of [Content_Types].xml with Default tags #46

Closed markkimsal closed 11 years ago

markkimsal commented 11 years ago

If you replace the Override content type tag for part word/document.xml with a Default content type tag docx4j will fail to load the file, complaining that it can only handle docx files. This behavior of using a Default tag instead of a specific Override tag is how Apache POI outputs docx files, so it is impossible to chain POI to docx4j right now.

Replace:

<Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>

with:

<Default Extension="xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>

Exception:

org.docx4j.openpackaging.exceptions.InvalidFormatException: Unexpected package (docx4j supports docx/docxm and pptx only
    at org.docx4j.openpackaging.contenttype.ContentTypeManager.createPackage(ContentTypeManager.java:834)
    at org.docx4j.openpackaging.io.LoadFromZipNG.process(LoadFromZipNG.java:213)
    at org.docx4j.openpackaging.io.LoadFromZipNG.get(LoadFromZipNG.java:193)
    at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:301)
markkimsal commented 11 years ago

Something like this works for me if I had ContentTypesManager, but I know it is not complete:

    if (getPartNameOverridenByContentType(ContentTypes.WORDPROCESSINGML_DOCUMENT) != null
            || getPartNameOverridenByContentType(ContentTypes.WORDPROCESSINGML_DOCUMENT_MACROENABLED) != null
            || getPartNameOverridenByContentType(ContentTypes.WORDPROCESSINGML_TEMPLATE ) != null
            || getPartNameOverridenByContentType(ContentTypes.WORDPROCESSINGML_TEMPLATE_MACROENABLED) != null ) {
        log.info("Detected WordProcessingML package ");
        p = new WordprocessingMLPackage(this);
        return p;
    } else if (getPartNameOverridenByContentType(ContentTypes.PRESENTATIONML_MAIN) != null
            || getPartNameOverridenByContentType(ContentTypes.PRESENTATIONML_TEMPLATE) != null
            || getPartNameOverridenByContentType(ContentTypes.PRESENTATIONML_SLIDESHOW) != null) {
        log.info("Detected PresentationMLPackage package ");
        p = new PresentationMLPackage(this);
        return p;
    } else if (getPartNameOverridenByContentType(ContentTypes.SPREADSHEETML_WORKBOOK) != null
            || getPartNameOverridenByContentType(ContentTypes.SPREADSHEETML_WORKBOOK_MACROENABLED) != null
            || getPartNameOverridenByContentType(ContentTypes.SPREADSHEETML_TEMPLATE) != null
            || getPartNameOverridenByContentType(ContentTypes.SPREADSHEETML_TEMPLATE_MACROENABLED) != null) {
        //  "xlam", "xlsb" ?
        log.info("Detected SpreadhseetMLPackage package ");
        p = new SpreadsheetMLPackage(this);
        return p;
    } else if (getPartNameOverridenByContentType(ContentTypes.DRAWINGML_DIAGRAM_LAYOUT) != null) {
        log.info("Detected Glox file ");
        p = new GloxPackage(this);
        return p;
    } else {

        //try to find a default content type that is just for extension=".xml"
        String defaultContentType = getContentType(new PartName("/word/document.xml", true));
        if (defaultContentType != null) {
            if (defaultContentType.equals(ContentTypes.WORDPROCESSINGML_DOCUMENT)) {
                log.info("Detected WordProcessingML package ");
                p = new WordprocessingMLPackage(this);
                return p;
            }
        }

        throw new InvalidFormatException("Unexpected package (docx4j supports docx/docxm and pptx only)");
      //log.warn("No part in [Content_Types].xml for content type"
      //        + ContentTypes.WORDPROCESSINGML_DOCUMENT);
      // TODO - what content type in this case?
      //return new Package(this);
    }
}
markkimsal commented 11 years ago

This also seems to work, and might be more complete:

/* Given a content type, return the Part Name URI is it
 * overridden by.
 */ 
public URI getPartNameOverridenByContentType(String contentType) { 

    // hmm, can there only be one instance of a given
    // content type?

    log.debug("getPartNameOverridenByContentType: " + contentType);
    Iterator i = overrideContentType.entrySet().iterator();
    while (i.hasNext()) { 
        Map.Entry e = (Map.Entry)i.next();
        if (e != null) { 
            log.debug("Inspecting " + ((CTOverride) e.getValue()).getContentType());
            if ( ((CTOverride)e.getValue()).getContentType().equals(contentType) ) { 
                log.debug("Matched!");
                return (URI)e.getKey(); 
            } 
        } 
    }      
    i = defaultContentType.entrySet().iterator();
    while (i.hasNext()) { 
        Map.Entry e = (Map.Entry)i.next();
        if (e != null) { 
            log.debug("Inspecting " + ((CTDefault) e.getValue()).getContentType());
            if ( ((CTDefault)e.getValue()).getContentType().equals(contentType) ) { 
                log.debug("Matched!");
                try { 
                    return new URI((String)e.getKey());
                } catch (java.net.URISyntaxException ex) { 
                    log.debug("URI Syntax exception: "+ e.getKey());
                    //continue;
                } 
            } 
        } 
    }      

    return null;

} 
plutext commented 11 years ago

See also discussion at http://stackoverflow.com/questions/15007550/dox4j-cannot-read-poi-saved-files-who-is-at-fault

Fixed https://github.com/plutext/docx4j/commit/1c1190fc3a2fc6e191c825a0e30fde2654cc997c