jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.88k stars 2.17k forks source link

Should wholeText() introduce newlines between block elements? #2083

Open h920526 opened 9 months ago

h920526 commented 9 months ago

Hi team,

Jsoup v1.16.1

<div><p>Hello</p><p>World</p></div>

after calling wholeText()

expected: Hello World

but actual: HelloWorld

does not wrap with new line thanks

jhy commented 9 months ago

This is "as designed" currently - wholeText gets only the non-normalized text values from the elements.

I have considered changing it to emit a newline when encountering a new block tag as that seems more useful.

text() will give you normalized text with a (space, not newline) between the nodes. That's designed for e.g. indexing / searching / extracting.

Would be good to hear opinions from folks on this. It seems safe and information preserving.

akashsahu25 commented 8 months ago

use br Tag

andyrozman commented 5 days ago

I tried to use wholeText() as a way to convert html to text, but it doesn't really work... \n are not ignored (they should be) and after that whole text had some weird identation...

and text() is even worse...

Is there any other command that could be used to convert html content into text that produces better results?

andyrozman commented 5 days ago

@h920526 For your case I think you need to wrap your text into html tags, I needed to do that, so something like this:

<html><body><div><p>Hello</p><p>World</p></div></body></html>

andyrozman commented 5 days ago

@jhy It might be useful to have command so that it can be converted to text. At the moment wholeText does this, but there are problems, see 1st message.