radkovo / CSSBox

CSSBox is an (X)HTML/CSS rendering engine written in pure Java. Its primary purpose is to provide a complete information about the rendered page suitable for further processing. However, it also allows displaying the rendered document.
http://cssbox.sourceforge.net/
GNU Lesser General Public License v3.0
232 stars 76 forks source link

DefaultDOMSource Document#getElementById() doesn't work #70

Open soundasleep opened 2 years ago

soundasleep commented 2 years ago

I found that if you try to load a Document via DefaultDOMSource, #getElementById() always returns null.

As far as I can tell, this is because cssbox is using NekoHTML as its XML parser, and it's not set up to be a validating parser, and Xerces is the underlying parser, that requires it to be a validating parser in order for id="..." to work . I think?

However I did find a fix on sourceforge by adding a custom filter to NekoHTML:

    @Override
    public Document parse() throws SAXException, IOException
    {
        //temporay NekoHTML fix until nekohtml gets fixed
        if (!neko_fixed)
        {
            HTMLElements.Element li = HTMLElements.getElement(HTMLElements.LI);
            HTMLElements.Element[] oldparents = li.parent;
            li.parent = new HTMLElements.Element[oldparents.length + 1];
            for (int i = 0; i < oldparents.length; i++)
                li.parent[i] = oldparents[i];
            li.parent[oldparents.length] = HTMLElements.getElement(HTMLElements.MENU);
            neko_fixed = true;
        }

        // start tweak
        HTMLConfiguration config = new HTMLConfiguration();
        XMLDocumentFilter idEnhancer = new DefaultFilter() {
            @Override
            public void startElement(QName element, XMLAttributes attributes, Augmentations augs) throws XNIException {
                int idx = attributes.getIndex("id");
                if (idx > -1) {
                    attributes.setType(idx, "ID");
                    Augmentations attrsAugs = attributes.getAugmentations(idx);
                    attrsAugs.putItem(Constants.ATTRIBUTE_DECLARED, Boolean.TRUE);
                }
                super.startElement(element, attributes, augs);
            }
        };
        XMLDocumentFilter[] filters = { idEnhancer };
        config.setProperty("http://cyberneko.org/html/properties/filters", filters);
        // end tweak

        DOMParser parser = new DOMParser(config);
        parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
        if (charset != null)
            parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset);
        parser.parse(new org.xml.sax.InputSource(getDocumentSource().getInputStream()));
        return parser.getDocument();
    }

I think this could be added to DefaultDOMSource, or HTMLConfiguration, but I'd imagine you'd want to add test cases as well, and I'm not sure what the implications of this might be.

soundasleep commented 2 years ago

Update: If you're trying to find IDs for elements that are naturally empty (such as <input>), turns out there's a separate filter for empty elements and normal elements. The XMLDocumentFilter should instead be:

XMLDocumentFilter idEnhancer = new DefaultFilter() {
    /**
     * Makes #getElementById() work on any set of attributes
     */
    private void possiblyAddIdAttribute(XMLAttributes attributes) {
        int idx = attributes.getIndex("id");
        if (idx > -1) {
            attributes.setType(idx, "ID");
            Augmentations attrsAugs = attributes.getAugmentations(idx);
            attrsAugs.putItem(Constants.ATTRIBUTE_DECLARED, Boolean.TRUE);
        }
    }

    @Override
    public void startElement(QName element, XMLAttributes attributes, Augmentations augs) throws XNIException {
        possiblyAddIdAttribute(attributes);
        super.startElement(element, attributes, augs);
    }

    @Override
    public void emptyElement(QName element, XMLAttributes attributes, Augmentations augs) throws XNIException {
        possiblyAddIdAttribute(attributes);
        super.emptyElement(element, attributes, augs);
    }
};