iabudiab / HTMLKit

An Objective-C framework for your everyday HTML needs.
MIT License
239 stars 27 forks source link

textContent strips <br/>s #32

Open guidedways opened 6 years ago

guidedways commented 6 years ago
let element:HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "Line<br/>Breaks"
print("\(element.textContent)")

output:

LineBreaks

desired:

Line\nBreaks

At leas this is how NSAttributedString's initWithHTML works. Anything I need to do to get this to work properly?

guidedways commented 6 years ago

Actually a better question would be: how do I get HTMLKit to behave just like NSAttributedString? The only reason I'm looking for an alternative is because it uses WebKit internally and keeps the runloop running on the main thread, causing other asynchronous issues. It looks like HTMLKit is returning be a string with all tags stripped, whereas I'd like it to return me an equivalent to what I'd get if I simply turned HTML to plain text.

iabudiab commented 6 years ago

@guidedways textContent is behaving as it should, i.e. <br> tags are stripped because they are not a textual content. Take a look here MDN Node.textContent

NSAttributedString

how do I get HTMLKit to behave just like NSAttributedString?

In order for HTMLKit to behave like NSAttributedString it should render the resulting HTML and then give back the resulting visual representation as a string. That's why NSAttributedString uses WebKit internally.

Plain Text

I'd like it to return me an equivalent to what I'd get if I simply turned HTML to plain text.

This is a much more complex topic than you would initially realise. The same with you other issue #31

Strictly speaking, the plain text variant of Line<br/>Breaks would be LineBreaks, because <br> is a HTML tag, i.e. the input is parsed to this DOM, assuming this is parsed as a fragment inside a <div>:

<div>Line<br>Breaks<div>

However HTML parsing is very lenient and even the most corrupt/invalid/unknown HTML would still produce a DOM tree that is more or less usable. Hence an input like this:

This is an <b>email</b>: John Do <john@do.com>

would produce this:

<div>This is an <b>email</b>: John Do <john@do.com></john@do.com></div>

Notice how the email <john@do.com> is now an element in the DOM.

Now let's take a look at another example, say the input is:

<table><tr><td>Hello<td>Plain<tr>Text

What would the plain text of this be? Is it HelloPlainText or HelloPlain\nTextor Hello\tPlain\nText or something completely different?

What I am trying to say is:

If you could provide a universally valid definition to turn HTML to plain text then maybe I could implement it.

HTML standard specifies one such definition and it is implemented via the textContent property. The bad news is, it is not usable for many purposes without further processing.

...

All this to say, I don't have a solution for this issue and still not completely sure how to solve #31 in a general way.

I'll let you know when I come to a conclusion.

Jcragons commented 3 years ago

@iabudiab sorry old here, but i'm new to the class :) i'm agree with your point on the strategy or the aglo to switch html element to plain text, obviously with a basic styling html string, the main issue for using here is <br> to nothing, could have been an option to <br> = space or <br> = newline break ? i use to see that in php classes where you can "map" which html node returning something, like<b> = **{textContent}** (markdown style) or <b> = textContent for plain text only

iabudiab commented 3 years ago

@Jcragons 👋 hey there. Let me see if I understood correctly. You want an option to be able to specify how some tags should be replaced when retrieving the textContent of a node, correct? i.e. something like

let element: HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "Hello<br/>World"
let text = element.textContent(withCustomRules: ["br": " "])
// text: Hello World

I guess this shouldn't be hard to implement. However, I won't promise anything about an ETA 😉

Jcragons commented 3 years ago

@iabudiabA yeah exactly that :) no pressure for ETA, I know :) anyway I think it could be a nice addition, a lot of people use an old class in Php just because there is this feature. I'm sure it could help a lot here :)