Closed GoogleCodeExporter closed 9 years ago
It is probably unnecessary to escape these characters in HTML text nodes,
though it is necessary in attribute bodies.
How is this causing problems though?
Original comment by mikesamuel@gmail.com
on 24 Jun 2013 at 5:03
It's causing problems because user-entered text is being returned to them in a
format different than they entered. I've put in a hack to unescape the
sanitized text which will fix this, but I'd like to not have to do that as this
might have unforeseen consequences.
Original comment by jcmathe...@gmail.com
on 24 Jun 2013 at 6:14
jcmather21, Without more context, I really can't help. Can you give an example
of user-entered text and explain why a different, but semantically equivalent
form is problematic?
If your hack involves replacing `"` with `"` and you white-list the "b" element
and an attribute like "title" then you might have problems with inputs like
<b title='foo " onmouseover="alert(1337)'>Foo
being sanitized to
<b title="foo " onmouseover="alert(1337)">Foo
and then you might hack that back to
<b title="foo " onmouseover="alert(1337)">Foo
which executes script when the user's mouse passes over the text.
Original comment by mikesamuel@gmail.com
on 24 Jun 2013 at 6:43
Yes sorry. I'm thinking outside the scope of html in the input string. For
example, a sentence like this:
He said, "This is the best place ever!"
I want that to be returned exactly like that, but it comes back as:
He said, "This is the best place ever!"
For my use case, I am taking user-entered description text and I want all html,
script, css removed (we only support plaintext now). But I want regular text,
left as is. So in your above example:
<b title='foo " onmouseover="alert(1337)'>Foo
I want back
Foo
Original comment by jcmathe...@gmail.com
on 24 Jun 2013 at 7:03
The HTML sanitizer takes messy-unsafe HTML and gives back well-formed-safe HTML.
It sounds like you want to get back plain text -- the innerText or textContent
-- without any tags at all.
If so, that's doable using the HTML sanitizer, but not using a method that is
advertised as returning HTML.
If that's what you need, let me know and I can knock up some example code.
Original comment by mikesamuel@gmail.com
on 24 Jun 2013 at 7:09
Yes that does sound like what I need so example code would be awesome on how to
accomplish this!
Original comment by jcmathe...@gmail.com
on 24 Jun 2013 at 7:14
If you have a policy builder called myPolicyBuilder, and a string of HTML
called myHtml then
StringBuilder sb = new StringBuilder();
HtmlSanitizer.policy = myPolicyBuilder.build(new HtmlStreamEventReceiver() {
public void openDocument() {}
public void closeDocument() {}
public void openTag(String elementName, List<String> attribs) {
if ("br".equals(elementName)) { sb.append('\n'); }
}
public void closeTag(String elementName) {}
public void text(String text) { sb.append(text); }
});
HtmlSanitizer.sanitize(myHtml, policy);
// sb should now contain the plain text content of the page with <br> replaced by newlines.
Original comment by mikesamuel@gmail.com
on 24 Jun 2013 at 7:43
Sorry. There were typos. Try
final StringBuilder sb = new StringBuilder();
HtmlSanitizer.Policy policy = myPolicyBuilder.build(new HtmlStreamEventReceiver() {
public void openDocument() {}
public void closeDocument() {}
public void openTag(String elementName, List<String> attribs) {
if ("br".equals(elementName)) { sb.append('\n'); }
}
public void closeTag(String elementName) {}
public void text(String text) { sb.append(text); }
});
HtmlSanitizer.sanitize(myHtml, policy);
// sb should now contain the plain text content of the page with <br> replaced by newlines.
Original comment by mikesamuel@gmail.com
on 24 Jun 2013 at 7:44
I haven't yet gotten that to work for all of my use cases but thank you.
Original comment by jcmathe...@gmail.com
on 25 Jun 2013 at 3:48
A solution is to use StringEscapeUtils.unescapeHtml() from apache.commons
before storing in database.
Original comment by rajkumar...@gmail.com
on 5 Jul 2014 at 7:34
StringEscapeUtils.unescapeHtml() is not a solution. A clever hacker can then
use <script> to inject XSS.
Original comment by cantara...@gmail.com
on 10 Sep 2014 at 9:58
Cantata, please see comments 5&6. The string is to be used as plain text, not
HTML.
Original comment by mikesamuel@gmail.com
on 11 Sep 2014 at 10:29
Original issue reported on code.google.com by
jcmathe...@gmail.com
on 24 Jun 2013 at 4:55