scinfu / SwiftSoup

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)
https://scinfu.github.io/SwiftSoup/
MIT License
4.52k stars 345 forks source link

Parse Line Breaks in .text() #156

Closed winsmith closed 4 years ago

winsmith commented 4 years ago

Is there a way to get line breaks out of parsed text? Suppose I have an element like so:

<p class="mycooltext">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit.<br><br>
  Vestibulum feugiat ex eu turpis efficitur bibendum.
</p>

If I use the text function on this element, I get

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum feugiat ex eu turpis efficitur bibendum.

But I'd rather have

Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n Vestibulum feugiat ex eu turpis efficitur bibendum.

with the <br> tags converted to newlines. Is that possible somehow?

amiantos commented 4 years ago

I, too, am curious about this.

For now I've worked around it by doing .replacingOccurrences(of: "<br />", with: "BREAK").replacingOccurrences(of: "</p>", with: "PARAGRAPH") on the output of .html() and then running .replacingOccurrences(of: "BREAK", with: "\n").replacingOccurrences(of: "PARAGRAPH", with: "\n\n") on the output of .text(). It's a kludge, but it works!

winsmith commented 4 years ago

My workaround is this: I don't use the text(), I instead use the html() of the element. Then I parse it into an NSAttributedString, which is actually what I want. But you could get the attributed string's string property to get a clear string.

// Get HTML Contents and convert them to Data
let contents = try doc.select(".b-story-body-x div p").html()
let data = Data(contents.utf8)

// Convert to NSAttributedString
guard let attributedString = try? NSAttributedString(data: data, options: [.documentType: NSAttributedString.DocumentType.html], documentAttributes: nil) else { return nil }

// If you need a clear string, use the attributed strings `.string` property
return attributedString.string

If you only need a string with line breaks, this is probably a bit wasteful, but I actually want the paragraph value with line breaks, ems, etc, so this is perfect for me.

scinfu commented 4 years ago

Try this and uncomment p tag if you need

let doc: Document = try! SwiftSoup.parse(MYHTML)
//set pretty print to false, so \n is not removed
doc.outputSettings(OutputSettings().prettyPrint(pretty: false))

//select all <br> tags and append \n after that
try doc.select("br").after("\\n")

//select all <p> tags and prepend \n before that
//try doc.select("p").before("\\n") // uncomment if needed

//get the HTML from the document, and retaining original new lines
let str = try doc.html().replacingOccurrences(of: "\\\\n", with: "\n")

let strWithNewLines = try SwiftSoup.clean(str, "", Whitelist.none(), OutputSettings().prettyPrint(pretty: false))
winsmith commented 4 years ago

This is super helpful, thank you very much!

scinfu commented 4 years ago

Closed due to inactivity, if necessary feel free to reopen.

ptrkstr commented 3 years ago

Thank you for the workaround @scinfu ! Not sure if things have changed since then, but I noticed the output included \\n instead of just \n (unless that was the intention). The double slash caused the new line to be escaped. What worked for me for was the following:

let doc: Document = try SwiftSoup.parse("A<br>A")
//set pretty print to false, so \n is not removed
doc.outputSettings(OutputSettings().prettyPrint(pretty: false))

//select all <br> tags and append \n after that
try doc.select("br").after("\n")

//select all <p> tags and prepend \n before that
//try doc.select("p").before("\n") // uncomment if needed

//get the HTML from the document, and retaining original new lines
let str = try doc.html()

let strWithNewLines = try SwiftSoup.clean(str, "", Whitelist.none(), OutputSettings().prettyPrint(pretty: false))

strWithNewLines = "A\nA"