Closed Reeniya closed 1 year ago
@jhy Could you please share any updates on this issue mentioned?
I believe this is OK and makes sense - the newline emitted as part of pretty-printing the original is then preserved plus a newline for the <br>
when you feed the first HTML into the second with pretty-printing disabled.
BTW, I find this line concerning:
String cleanText = Jsoup.clean(cleanWithHtmlTags, "", Safelist.none(), new org.jsoup.nodes.Document.OutputSettings().prettyPrint(false));
The output of Jsoup.clean is HTML, not text. Safeline.none()
will only pass textnodes, but the output is still HTML. (I may be reading too much into your variable name -- if you consider it 'cleanTextNodeOnlyHtml' then that's fine.)
If you want plain text with some formatting, this is how I would approach the problem:
String inputText = """
<span style="padding-top: 0px; padding-bottom: 0px; margin-top: 0px; margin-bottom: 0px; border-spacing: 2px 2px;">testing<br />testing</span>
""";
Cleaner cleaner = new Cleaner(Safelist.none().addTags("br", "p", "tr", "div"));
Document clean = cleaner.clean(Jsoup.parse(inputText));
System.out.println(clean.body().html());
System.out.println("---");
System.out.println(clean.wholeText());
Gives:
testing
<br>
testing
---
testing
testing
Since version 1.16.1 we have now this issue:
Jsoup.clean("<title>titre</title> <br/><b>hello</b> <a>link</a>", Safelist.none().addTags("br"))
This used to corretly return this: titre <br> hello link
But now: titre\n<br>\nhello link
@jhy Do you have any suggestion to handle this ?
That will render in browser as the same visual output, so it's just a matter of preference on the layout of the source.
You can disable the pretty-printer if you don't want to use it. There's no finer control of what it emits currently, though.
Hi,
We had raised a issue few days back https://github.com/jhy/jsoup/issues/1911 related newline character being missed. So that was converted to a bug and was fixed which would be released as part of 1.16.1.
So we wanted to test if this fix would resolve the issue we were facing. So we consumed the 1.16.1-SNAPSHOT version to test few of our scenarios. During our testing we found that Jsoup clean is behaving a bit differently with respect to
<br />
tag We are using html parser.here is the difference:
Jsoup clean() code looks something like this:
with 1.15.4 this is the output we are seeing
with 1.16.1-SNAPSHOT
With 1.16.1-SNAPSHOT we are seeing an additional \n getting adding before
<br>
tag which is adding two new lines in the final textwith 1.15.4 version this is the output
@jhy is this an expected behaviour? or will it be a new issue that will be introduced as part of 1.16.1 release?