iabudiab / HTMLKit

An Objective-C framework for your everyday HTML needs.
MIT License
239 stars 27 forks source link

Implement HTML escaping for arbitrary string input #31

Open guidedways opened 6 years ago

guidedways commented 6 years ago

This looks like a powerful library to navigate around HTML nodes, however what would be the simplest method of obtaining cleaned up 'plain text' from HTML input? I'd like it to preserve any 'invalid' non-html tags such as John Do <john@do.com> and not try and parse it as NSAttributedString's initWithHTML does.

guidedways commented 6 years ago

Okay the following seems to fail

let element:HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "This is an <b>email</b>: John Do <john@do.com>"
print("\(element.textContent)")

outputs: This is an email: John Do

What do I have to do to make this work so that it ignores anything that doesn't look like HTML?

iabudiab commented 6 years ago

@guidedways Hey there. Let me see if I understood you correctly.

You want to input a HTML string and have all HTML tags stripped, as in This is an <b>email</b>: John Do <john@do.com> should return This is an email: John Do <john@do.com>?

If so, then the easiest way to do it, is to escape all HTML reserved characters to prevent interpreting them as HTML. In your case:

let element: HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "This is an <b>email</b>: John Do &lt;john@do.com&gt;"
print("\(element.textContent)")
// This is an email: John Do <john@do.com>

Some Details

innerHTML in HTMLKit behaves like it would in a browser, i.e. it sets the HTML content of an element to the string that is passed. The string is then interpreted as a HTML fragment and is parsed inside the element as its parent context.

What does it mean? Well, your input gets parsed to this DOM:

<div>This is an  <b>email</b>: John Do <john@do.com></john@do.com></div>

Take a look here for more info: MDN Element.innerHTML

Does this answer you question? Do you have any followup questions?

guidedways commented 6 years ago

Yes that is the output I'm after, but I am not in control of the string being received from the user. It could be anything <some strange non-html tag>. I need the library to be able to do this for me so I can escape < as &lt;. Can HTMLKit find and escape non-html 'tags' for me?

guidedways commented 6 years ago

I should explain. I'm receiving input directly from the user as notes. The notes could be actual HTML or could be partial / invalid HTML. There's no way to tell since they're free to type in whatever they wish. What I need to do is be able to parse HTML and extract the plain text version of whatever they entered, however I need to retain any such odd entries, links etc that aren't otherwise entered as HTML.

iabudiab commented 6 years ago

@guidedways I see, currently HTMLKit does not provide this functionality. I'll see if I could implement this in the next couple of days. Will let you know as soon as I have something.

I'll rename the issue then and mark as feature request.

guidedways commented 6 years ago

Thank you, that would be extremely helpful!