jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.94k stars 2.19k forks source link

Self-closing tags are replaced by opening and closing tags by Element append(String html) #1893

Closed Reeniya closed 1 year ago

Reeniya commented 1 year ago

I am uplifting Jsoup version from 1.11.3 to 1.15.3 in my project. I am seeing that there is a difference in the way self closing tags are handled by public Element append(String html) (https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Element.java#L732)

With 1.11.3 version of jsoup I see that when we have <br> tag and i call the append(String html) function it replace the <br> with <br />

But with 1.15.3 version of jsoup I see that when we have <br> tag and i call the append(String html) function it replace the <br> with <br></br>

I see that when i call append(String html) it uses HTMLTreeBuilder, So I think append(String html) uses HTML parser to parse the html String provided as input.

I want the append(String html) to return <br> tag as <br /> when I uplift the jsoup version to 1.15.3. How can this be achieved? is this a issue with the latest version?

Please can someone give me suggestions how I can handle this case in my code. I need to use the append(String html) function to append a sub section within a section.

For example If my subsection is <p> this is <br> text</p> when I call append(String html) to append it to the <div> tag it should return me <div><p> this is <br /> text</p></div> ( Note: this is the behavior in 1.11.3 version of jsoup, but not in 1.15.3)

Thank you in advance....

jhy commented 1 year ago

Can you provide actual test code that shows the behavior (code, what you get, vs what you want). I'm not clear on what parser (HTML or XML), what you're appending, what serializer you're using, etc.

Reeniya commented 1 year ago

Hi @jhy I was trying to write a simple test to re-create this scenario. I was not able to exactly re-create this case but found a different behavior when I call append(String html)

Input string = <p><br /> this is a sample text</p>

HTML string output from jsoup xml parser: <html><head></head><body><p><br /> this is a sample text</p></body></html>

HTML string after append is called: <div> <p><br> this is a sample text</p> </div>

Test code:

    public void TestAppend(){
        String inputHTML ="<p><br /> this is a sample text</p>";

        Document xmlDoc = Jsoup.parse(inputHTML, Parser.xmlParser());
        Document container = Document.createShell("");
        int nodesCount = xmlDoc.childNodeSize();
        for (int i = 0; i < nodesCount; i++) {
            container.body().appendChild(xmlDoc.childNode(0));
        }
        xmlDoc = container;
        xmlDoc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
        xmlDoc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
        System.out.println("HTML string output  from jsoup xml parser:\n"+xmlDoc.toString());

        Element sectionElement = new Element(Tag.valueOf("div"), "");
        sectionElement.empty();
        sectionElement.append(xmlDoc.outerHtml()); 
        System.out.println("HTML string after append is called: \n"+sectionElement.toString());
    }

If we see the output from the xmlParser we can see <br /> is retained as it is. But as soon as we call sectionElement.append(xmlDoc.outerHtml()); , <br /> gets converted to <br >

If we call append(String html) multiple times to create subsections it will convert <br> to <br></br>. I am not able to re-create this case with a simple test.

How can I ensure that when I call sectionElement.append(xmlDoc.outerHtml()); does not convert <br /> to <br > ?

Is there a reason why append(String html) converts <br /> to <br > when it encounters <br /> ?

Thanks

jhy commented 1 year ago

I see that when i call append(String html) it uses HTMLTreeBuilder, So I think append(String html) uses HTML parser to parse the html String provided as input.

No, it uses the same parser as the parser that produced the Element you are appending do. See: https://github.com/jhy/jsoup/blob/da23af85f39df9f0df732029ed26c34764811009/src/main/java/org/jsoup/nodes/Element.java#L732-L737

In your example code, you are creating a new Element outside of a configured parser, and so that defaults to using the HTML parser:

Element sectionElement = new Element(Tag.valueOf("div"), "");

And then you are re-parsing the XML doc as HTML (in the context of the HTML Element sectionElement). Hence, the self-closing br tag is emitted as <br>.

Also, you might like to use the wrap(html) method to wrap a div or other content around existing content. It works more efficiently and simply than serialising and re-parsing as you're doing now.

Here's an example:

String xml = "<p><br />Text</p>";
Document doc = Jsoup.parse(xml, Parser.xmlParser());
Element p = doc.selectFirst("p");
Element div = p.wrap("<div>");
p.append("<br />");
System.out.println("XML:\n" + doc.outerHtml());

Produces:

XML:
<div><p><br />Text<br /></p></div>
Reeniya commented 1 year ago

Thank @jhy this resolved the issue what I was facing.