ingomohr / docwriter

API to write docx documents
MIT License
1 stars 0 forks source link

DocxDataInspector.getAllElements(Object, Class<T>) Ignores SdtElements #60

Closed eum2o closed 2 years ago

eum2o commented 2 years ago

Tested with docwriter 2.1.0

Description The mentioned method does not find texts in SdtElements (which are used e.g. for TOCs) because they are not of type ContentAccessor, thus their children are not examined.

See DocxDataInspector.java#L47

Given A document can have content (in the body of the main document) like this:

<w:sdt>
    <w:sdtPr>
        <w:docPartObj>
            <w:docPartGallery w:val="Table of Contents"/>
            <w:docPartUnique/>
        </w:docPartObj>
        <w:id w:val="1078003780"/>
    </w:sdtPr>
    <w:sdtContent>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Inhaltsverzeichnisberschrift"/>
            </w:pPr>
            <w:r>
                <w:t>Table of Contents</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:fldChar w:fldCharType="begin"/>
            </w:r>
            <w:r>
                <w:instrText xml:space="preserve">TOC \o &quot;1-3&quot; \n 1-3 \h \z \u</w:instrText>
            </w:r>
            <w:r>
                <w:fldChar w:fldCharType="separate"/>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:fldChar w:fldCharType="end"/>
            </w:r>
        </w:p>
    </w:sdtContent>
</w:sdt>    
<w:p w14:paraId="5F641DF7" w14:textId="79F4A31F" w:rsidR="008143B7" w:rsidRDefault="000516EE" w:rsidP="000516EE">           
    <w:r w:rsidR="008143B7" w:rsidRPr="008143B7">
        <w:rPr>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t>Change History</w:t>
    </w:r>          
</w:p>

Example Call

WordprocessingMLPackage doc = parseDocument("path...");
MainDocumentPart mainPart = doc.getMainDocumentPart();
List<Text> texts = new DocxDataInspector().getAllElements(mainPart, Text.class);

Problem/Bug

Hint Think about reusing org.docx4j.TraversalUtil, to avoid duplicating the traversal stuff. For example as a workaround I used something like

final List<String> texts = new ArrayList<>();
final WordprocessingMLPackage document = parseDocument("path...");
final Body body = document.getMainDocumentPart().getJaxbElement().getBody();
TraversalUtil.visit(body, new org.docx4j.TraversalUtil.CallbackImpl() {

    @Override
    public List<Object> apply(Object pObj) {
        if (pObj instanceof Text) {
            final String text = ((Text) pObj).getValue();
            texts.add(text);
        }
        return null;
    }
});
ingomohr commented 2 years ago

Thanks for the feedback, @eum2o. Will look into this.

ingomohr commented 2 years ago

Change

Note: The impl used now doesn't supporting finding SdtElement objects, though. It recognizes them and then returns the contents found in them. See org.docx4j.TraversalUtil.getChildrenImpl(Object). So: It will still find Text elements in them - as requested for this issue.

If we need to find those direct nodes, too, we should maybe discuss contributing to Docx4J.