jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.94k stars 2.19k forks source link

Difference in Jsoup clean behaviour between 1.15.4 and 1.16.1-SNAPSHOT with respect to <br /> tag #1916

Closed Reeniya closed 1 year ago

Reeniya commented 1 year ago

Hi,

We had raised a issue few days back https://github.com/jhy/jsoup/issues/1911 related newline character being missed. So that was converted to a bug and was fixed which would be released as part of 1.16.1.

So we wanted to test if this fix would resolve the issue we were facing. So we consumed the 1.16.1-SNAPSHOT version to test few of our scenarios. During our testing we found that Jsoup clean is behaving a bit differently with respect to <br /> tag We are using html parser.

here is the difference:

Jsoup clean() code looks something like this:

String InputText = "<span style="padding-top: 0px; padding-bottom: 0px; margin-top: 0px; margin-bottom: 0px; border-spacing: 2px 2px;">testing<br />testing</span>"

String cleanWithHtmlTags = Jsoup.clean(InputText, "", Safelist.none().addTags("br", "p", "tr", "div"), new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true));

String cleanText = Jsoup.clean(cleanWithHtmlTags, "", Safelist.none(), new org.jsoup.nodes.Document.OutputSettings().prettyPrint(false));

with 1.15.4 this is the output we are seeing

cleanWithHtmlTags: testing<br>\ntesting
cleanText: testing\ntesting

with 1.16.1-SNAPSHOT

cleanWithHtmlTags: testing\n<br>\ntesting
cleanText: testing\n\ntesting

With 1.16.1-SNAPSHOT we are seeing an additional \n getting adding before <br> tag which is adding two new lines in the final text

testing

testing

with 1.15.4 version this is the output

testing
testing

@jhy is this an expected behaviour? or will it be a new issue that will be introduced as part of 1.16.1 release?

brdeepak commented 1 year ago

@jhy Could you please share any updates on this issue mentioned?

jhy commented 1 year ago

I believe this is OK and makes sense - the newline emitted as part of pretty-printing the original is then preserved plus a newline for the <br> when you feed the first HTML into the second with pretty-printing disabled.

BTW, I find this line concerning:

String cleanText = Jsoup.clean(cleanWithHtmlTags, "", Safelist.none(), new org.jsoup.nodes.Document.OutputSettings().prettyPrint(false));

The output of Jsoup.clean is HTML, not text. Safeline.none() will only pass textnodes, but the output is still HTML. (I may be reading too much into your variable name -- if you consider it 'cleanTextNodeOnlyHtml' then that's fine.)

If you want plain text with some formatting, this is how I would approach the problem:

String inputText = """
    <span style="padding-top: 0px; padding-bottom: 0px; margin-top: 0px; margin-bottom: 0px; border-spacing: 2px 2px;">testing<br />testing</span>
    """;
Cleaner cleaner = new Cleaner(Safelist.none().addTags("br", "p", "tr", "div"));
Document clean = cleaner.clean(Jsoup.parse(inputText));

System.out.println(clean.body().html());
System.out.println("---");
System.out.println(clean.wholeText());

Gives:

testing
<br>
testing
---
testing
testing
terrettaz commented 1 year ago

Since version 1.16.1 we have now this issue:

Jsoup.clean("<title>titre</title> <br/><b>hello</b> <a>link</a>", Safelist.none().addTags("br"))

This used to corretly return this: titre <br> hello link But now: titre\n<br>\nhello link

@jhy Do you have any suggestion to handle this ?

jhy commented 1 year ago

That will render in browser as the same visual output, so it's just a matter of preference on the layout of the source.

You can disable the pretty-printer if you don't want to use it. There's no finer control of what it emits currently, though.