jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.94k stars 2.19k forks source link

Allow clean to keep text nodes with style #1993

Closed silk-bahamut closed 1 year ago

silk-bahamut commented 1 year ago

I would like to clean some html with styling to remove some tags but keep the styling of the text nodes But it seems if the node is of type TextNode the tag and all styling is lost

  @Test
  void keepStyle() {
    Assertions.assertThat(Jsoup.clean("""
                <p>
                  <a href="http://google.fr>should be removed</a>
                  <div>not allowed<span>allowed be inside</span></div>
                  <span style="background-color: #ba372a;">should be kept with style</span>
                </p>
                """, new Safelist()
        .addTags("p", "b", "em", "i", "strong", "u", "span", "ul", "ol", "li", "pre", "h1", "h2", "h3", "h4", "h5", "h6")
        .addAttributes(":all", "style"))
        )
        .isEqualTo("""
            <span style="background-color: #ba372a;">should be kept with style</span>
            """);
  }
jhy commented 1 year ago

This isn't an issue with the Cleaner. Your input HTML has a missing " in the <a href> attribute, which makes most of the content an attribute value. If you fix that, the clean works:

String html = """
    <p>
    <a href="http://google.fr">should be removed</a>
    <div>not allowed<span>allowed be inside</span></div>
    <span style="background-color: #ba372a;">should be kept with style</span>
    </p>
""";
Safelist allowStyle = new Safelist()
    .addTags("p", "b", "em", "i", "strong", "u", "span", "ul", "ol", "li", "pre", "h1", "h2", "h3", "h4", "h5", "h6")
    .addAttributes(":all", "style");

String clean = Jsoup.clean(html, allowStyle);

System.out.println(clean);

Gives:

<p>should be removed</p>not allowed<span>allowed be inside</span> <span style="background-color: #ba372a;">should be kept with style</span>
<p></p>