Closed GoogleCodeExporter closed 9 years ago
A whitelisted, opt-in solution, great! I have a feeling this will lead to a
very big list, though. Can we deliver one safely, I wonder? Should the
developer be able to specify ranges or languages to allow? How can we make it
easier on a developer than:
<list-of-chars-to-not-encode>
<char>(special char 1)</char>
<char>(special char 2)</char>
...
<char>(special char n)</char>
</list-of-chars-to-not-encode>
... where n is huge? This is a strong candidate for inclusion in version 1.5.
Original comment by arshan.d...@gmail.com
on 23 Feb 2011 at 1:25
Original comment by arshan.d...@gmail.com
on 23 Feb 2011 at 1:26
Oftenly (as in our case) we only plan to support just one language. So it would
be a great start to offer just the simple white-list as suggested in your
comment. Perhaps a bit more compactly like
<chars>(special char 1),(special char2)</chars>.
This opens the possibility to later add a range feature like
<chars>(special char 1)-(special char n),(special char x)</chars>
But at least for us the simple list of umlauts (seven for german languages)
would be more than enough.
I'm not sure, if the commas are needed, to distinguish between the several
characters?
Original comment by ewert%ne...@gtempaccount.com
on 23 Feb 2011 at 8:00
BTW the other way around would also be great, a negative-list for characters to
be encoded. Currently chars like ";/\='" don't get encoded and some
people/tools (like XSS Me) say, that it would be the best, if they where
encoded, just in case.
Original comment by ewert%ne...@gtempaccount.com
on 23 Feb 2011 at 11:11
Having poked around in the code at runtime, it seems that AntiSamy itself is
taking care of a perfectly reasonable set of HTML escaping (things like < and &
etc) using HTMLEntityEncoder but after that's done, the
org.apache.xml.serialize.XHTMLSerializer does further encoding on the end
result.
It also looks like that behaviour could be turned off with a call to:
XHTMLSerializer.startNonEscaping()
Which could be triggered from another configuration element (for example
Policy.ESCAPE_ALL or similar).
Which might be a great solution since it only requires one new policy file
directive.
Failing that, a whitelist is fine by me - just copy almost everything from the
HTML 5 Entity ref!
When is this likely to happen Arshan? I really need to get this sorted in
production code... :)
(Great work on this by the way - saved us so much bother!)
Original comment by RedYetiD...@gmail.com
on 23 Feb 2011 at 5:01
Hi Arshan,
Have you any idea when we might see this in a release please? :)
Thanks!
Original comment by RedYetiD...@gmail.com
on 11 May 2011 at 9:41
Issue 108 has been merged into this issue.
Original comment by arshan.d...@gmail.com
on 7 Jun 2011 at 5:20
Very glad to see this is still on the radar!
Can you let us know when we might be able to see a release in the Maven repo
containing this enhancement?
Thanks again!
Original comment by RedYetiD...@gmail.com
on 8 Jun 2011 at 9:06
That would be awesome. At the moment I am re-encoding the entities using
org.apache.commons.lang.StringEscapeUtils;
Original comment by husseini...@gmail.com
on 15 Jun 2011 at 4:46
[deleted comment]
Hi,
Can somebody suggest a workaround for this issue. My french string contains é,
it got changed to &e.
Isn't it doing UTF-8 encoding? can we disbale the encoding of output string?
Also any plan to support these kind of multilingual characters soon?
thanks
Original comment by job...@gmail.com
on 11 Aug 2011 at 12:23
What I'm doing it:
Strip all HTML.
Then use
org.springframework.web.util.HtmlCharacterEntityReferences.htmlUnescape() to
put all the references back.
Then use my own StringHelper.htmlEscapeToSanitise() to santise a certain set of
dangerous HTML that shouldn't be in the fields (quotes, angle brackets etc.)
It's not great but it works!
Original comment by RedYetiD...@gmail.com
on 11 Aug 2011 at 12:50
And since it's easy but never the less tedious to write - here's
htmlEscapeToSanitise
(Classes used below are probably from org.apache.commons!)
private static final String[] DANGEROUS_HTML_CHARS_TO_ENCODE =
(String[])ArrayUtils.addAll(QUOTE_CHARS_TO_ENCODE, new String[] {
"&",
"<",
">",
"'"});
private static final String[] HTML_ENCODED_DANGEROUS_HTML_CHARS =
(String[])ArrayUtils.addAll(HTML_ENCODED_QUOTE_CHARS, new String[] {
"&",
"<",
">",
"'"}); // Note that we use the HTML escape for apostrophe (' is not valid HTML - it's XML/XHTML/SGML)
public static String htmlEscapeToSanitise(String input)
{
return StringUtils.replaceEach(input, DANGEROUS_HTML_CHARS_TO_ENCODE, HTML_ENCODED_DANGEROUS_HTML_CHARS);
}
Original comment by RedYetiD...@gmail.com
on 11 Aug 2011 at 12:54
Are you suggesting not use Antisamy and use this approach? if not any way
integrate it with Antisamy?
Original comment by job...@gmail.com
on 12 Aug 2011 at 7:24
No I'm certainly not suggesting to use this instead of AntiSamy. Home cooking
HTML validation is not sensible - it's far too complex an area.
In fact this approach uses AntiSamy: The step that says; "Strip all HTML."
should probably have been more explicit and actually read; "Strip all HTML
/with AntiSamy/".
This is a work-around, not a fix. In other words I still have this problem and
am waiting on a fix from the AntiSamy team.
Original comment by RedYetiD...@gmail.com
on 12 Aug 2011 at 9:09
There is another, slightly easier work around: after cleaning up with antiSamy,
re-encode the content using org.apache.commons.lang.StringEscapeUtils
I'm frankly surprised that AntiSamy does not have such a feature. This makes it
unusable for any CMS whose user language is not English. They'll see gibberish
when they go to edit their content.
Original comment by husseini...@gmail.com
on 12 Aug 2011 at 12:47
I'd rather avoid turning this bug report into a thread on how to work around
this.. however:
I may be missing something here Husseini but using StringEscapeUtils is just
the same as using HtmlCharacterEntityReferences. They both result in unescaped
HTML references.
So, not easier, just the same surely?
The extra step I then add is to make it safer by re-escaping angles and quotes.
So:
1) Clean all HTML with AntiSamy -> escaped HTML
2) Use either HtmlCharacterEntityReferences or StringEscapeUtils -> unescaped
HTML
3) Optionally use the home-cooked sanitiser mentioned above
Original comment by RedYetiD...@gmail.com
on 12 Aug 2011 at 1:01
Actually DO NOT use the technique I suggested. It is completely unsafe.
AntiSamy is practically useless here: Check the following string:
<script>alert("hello world");</script>
Using the technique I suggested is quite dangerous because the encode will
encode the > and < and there you have it, XSS.
Original comment by husseini...@gmail.com
on 19 Aug 2011 at 5:05
[deleted comment]
I find lack of support for UTF-8 characters surprising.
I don't think support for none-English language characters should be a special
requirement in this day and age.
Original comment by abitdo...@gmail.com
on 19 Aug 2011 at 5:21
Are you actually reading my replies to this thread?
Yes, the String: <script>alert("hello world");</script> when run through an
AntiSamy then subsequently through an unsecape() call will remain dangerous.
Which is precisely why I posted the code above, that in the next reply I
mentioned as "step 3)" which solves the problem since it encodes particularly
dangerous characters.
But that's not any fault of Antisamy's; if we decide to call unescape() using
some third party library on the result (and don't then re-encode the dangerous
characters yourself) what can AntiSamy possibly do?
Anyhow, can we leave this here and wait for a proper fix?
Arshan? Any chance of this being fixed so we can stop talking about work
around?? :)
Original comment by RedYetiD...@gmail.com
on 19 Aug 2011 at 5:34
And finally (I hope):
I'd very much prefer a blacklist approach - so I can specify just the
characters I consider dangerous (as above) and have all other characters left
alone. Without having to work out what those special characters are and pass
them in.
Original comment by RedYetiD...@gmail.com
on 19 Aug 2011 at 5:37
Checked in a solution to HEAD: a new directive, "entityEncodeIntlChars"
(default: false).
When true, "international" characters will be represented by their HTML
entities as according to the HTML DTD. When false, they'll be echoed as-is, to
the worry of the person who set this setting to true.
Original comment by arshan.d...@gmail.com
on 16 Sep 2011 at 6:22
What is the status on the fix mentioned above? It was posted on sept 15th, but
on the dowloads area is still the 1.4.4 version.
Original comment by ejjaq...@gmail.com
on 2 Feb 2012 at 1:41
So is this now included in the 1.5.1 version in the downloads area?
Original comment by RedYetiD...@gmail.com
on 26 Mar 2013 at 11:08
Original issue reported on code.google.com by
ewert%ne...@gtempaccount.com
on 21 Feb 2011 at 9:21