DocxDataInspector.getAllElements(Object, Class<T>) Ignores SdtElements

eum2o commented 2 years ago

Tested with docwriter 2.1.0

Description The mentioned method does not find texts in SdtElements (which are used e.g. for TOCs) because they are not of type ContentAccessor, thus their children are not examined.

See DocxDataInspector.java#L47

Given A document can have content (in the body of the main document) like this:

<w:sdt>
    <w:sdtPr>
        <w:docPartObj>
            <w:docPartGallery w:val="Table of Contents"/>
            <w:docPartUnique/>
        </w:docPartObj>
        <w:id w:val="1078003780"/>
    </w:sdtPr>
    <w:sdtContent>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Inhaltsverzeichnisberschrift"/>
            </w:pPr>
            <w:r>
                <w:t>Table of Contents</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:fldChar w:fldCharType="begin"/>
            </w:r>
            <w:r>
                <w:instrText xml:space="preserve">TOC \o &quot;1-3&quot; \n 1-3 \h \z \u</w:instrText>
            </w:r>
            <w:r>
                <w:fldChar w:fldCharType="separate"/>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:fldChar w:fldCharType="end"/>
            </w:r>
        </w:p>
    </w:sdtContent>
</w:sdt>    
<w:p w14:paraId="5F641DF7" w14:textId="79F4A31F" w:rsidR="008143B7" w:rsidRDefault="000516EE" w:rsidP="000516EE">           
    <w:r w:rsidR="008143B7" w:rsidRPr="008143B7">
        <w:rPr>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t>Change History</w:t>
    </w:r>          
</w:p>

Example Call

WordprocessingMLPackage doc = parseDocument("path...");
MainDocumentPart mainPart = doc.getMainDocumentPart();
List<Text> texts = new DocxDataInspector().getAllElements(mainPart, Text.class);

Problem/Bug

Expected: texts contains "Change History" and "Table of Contents"
Actual: texts contains "Change History"

Hint Think about reusing org.docx4j.TraversalUtil, to avoid duplicating the traversal stuff. For example as a workaround I used something like

final List<String> texts = new ArrayList<>();
final WordprocessingMLPackage document = parseDocument("path...");
final Body body = document.getMainDocumentPart().getJaxbElement().getBody();
TraversalUtil.visit(body, new org.docx4j.TraversalUtil.CallbackImpl() {

    @Override
    public List<Object> apply(Object pObj) {
        if (pObj instanceof Text) {
            final String text = ((Text) pObj).getValue();
            texts.add(text);
        }
        return null;
    }
});

ingomohr commented 2 years ago

Thanks for the feedback, @eum2o. Will look into this.

ingomohr commented 2 years ago

Change

Refactored to use TraversalUtil and ClassFinder instead.

Note: The impl used now doesn't supporting finding SdtElement objects, though. It recognizes them and then returns the contents found in them. See org.docx4j.TraversalUtil.getChildrenImpl(Object). So: It will still find Text elements in them - as requested for this issue.

If we need to find those direct nodes, too, we should maybe discuss contributing to Docx4J.

ingomohr / docwriter

DocxDataInspector.getAllElements(Object, Class<T>) Ignores SdtElements #60

Change