Whitelist for special characters being untouched

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. scan a text with umlauts
2.
3.

What is the expected output? What do you see instead?
I want to get back the umlauts, but they are encoded as HTML-entities.

What version of the product are you using? On what operating system?
We are using antisamy 1.4.

Please provide any additional information below.
We are currently using AntiSamy for sanatizing free text's supplied by users. 
The text isn't always meant to contain HTML, so the HTML-entities are not 
displayed correctly. 

It would be great, if we could configure a list of umlauts, that should be left 
untouched, because they represent no security risk.

Original issue reported on code.google.com by ewert%ne...@gtempaccount.com on 21 Feb 2011 at 9:21

GoogleCodeExporter commented 9 years ago

A whitelisted, opt-in solution, great! I have a feeling this will lead to a 
very big list, though. Can we deliver one safely, I wonder? Should the 
developer be able to specify ranges or languages to allow? How can we make it 
easier on a developer than:

<list-of-chars-to-not-encode>
  <char>(special char 1)</char>
  <char>(special char 2)</char>
  ...
  <char>(special char n)</char>
</list-of-chars-to-not-encode>

... where n is huge? This is a strong candidate for inclusion in version 1.5.

Original comment by arshan.d...@gmail.com on 23 Feb 2011 at 1:25

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Original comment by arshan.d...@gmail.com on 23 Feb 2011 at 1:26

GoogleCodeExporter commented 9 years ago

Oftenly (as in our case) we only plan to support just one language. So it would 
be a great start to offer just the simple white-list as suggested in your 
comment. Perhaps a bit more compactly like 

<chars>(special char 1),(special char2)</chars>. 

This opens the possibility to later add a range feature like 

<chars>(special char 1)-(special char n),(special char x)</chars>

But at least for us the simple list of umlauts (seven for german languages) 
would be more than enough.

I'm not sure, if the commas are needed, to distinguish between the several 
characters?

Original comment by ewert%ne...@gtempaccount.com on 23 Feb 2011 at 8:00

GoogleCodeExporter commented 9 years ago

BTW the other way around would also be great, a negative-list for characters to 
be encoded. Currently chars like ";/\='" don't get encoded and some 
people/tools (like XSS Me) say, that it would be the best, if they where 
encoded, just in case.

Original comment by ewert%ne...@gtempaccount.com on 23 Feb 2011 at 11:11

GoogleCodeExporter commented 9 years ago

Having poked around in the code at runtime, it seems that AntiSamy itself is 
taking care of a perfectly reasonable set of HTML escaping (things like < and & 
etc) using HTMLEntityEncoder but after that's done, the 
org.apache.xml.serialize.XHTMLSerializer does further encoding on the end 
result.

It also looks like that behaviour could be turned off with a call to:

XHTMLSerializer.startNonEscaping()

Which could be triggered from another configuration element (for example 
Policy.ESCAPE_ALL or similar).

Which might be a great solution since it only requires one new policy file 
directive.

Failing that, a whitelist is fine by me - just copy almost everything from the 
HTML 5 Entity ref!

When is this likely to happen Arshan? I really need to get this sorted in 
production code... :)

(Great work on this by the way - saved us so much bother!)

Original comment by RedYetiD...@gmail.com on 23 Feb 2011 at 5:01

GoogleCodeExporter commented 9 years ago

Hi Arshan,

Have you any idea when we might see this in a release please? :)

Thanks!

Original comment by RedYetiD...@gmail.com on 11 May 2011 at 9:41

GoogleCodeExporter commented 9 years ago

Issue 108 has been merged into this issue.

Original comment by arshan.d...@gmail.com on 7 Jun 2011 at 5:20

GoogleCodeExporter commented 9 years ago

Very glad to see this is still on the radar!

Can you let us know when we might be able to see a release in the Maven repo 
containing this enhancement?

Thanks again!

Original comment by RedYetiD...@gmail.com on 8 Jun 2011 at 9:06

GoogleCodeExporter commented 9 years ago

That would be awesome. At the moment I am re-encoding the entities using 
org.apache.commons.lang.StringEscapeUtils;

Original comment by husseini...@gmail.com on 15 Jun 2011 at 4:46

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Hi,
Can somebody suggest a workaround for this issue. My french string contains é, 
it got changed to &e. 
Isn't it doing UTF-8 encoding? can we disbale the encoding of output string?
Also any plan to support these kind of multilingual characters soon?

thanks

Original comment by job...@gmail.com on 11 Aug 2011 at 12:23

GoogleCodeExporter commented 9 years ago

What I'm doing it:

Strip all HTML.
Then use 
org.springframework.web.util.HtmlCharacterEntityReferences.htmlUnescape() to 
put all the references back.
Then use my own StringHelper.htmlEscapeToSanitise() to santise a certain set of 
dangerous HTML that shouldn't be in the fields (quotes, angle brackets etc.)

It's not great but it works!

Original comment by RedYetiD...@gmail.com on 11 Aug 2011 at 12:50

GoogleCodeExporter commented 9 years ago

And since it's easy but never the less tedious to write - here's 
htmlEscapeToSanitise

(Classes used below are probably from org.apache.commons!)

private static final String[] DANGEROUS_HTML_CHARS_TO_ENCODE = 
(String[])ArrayUtils.addAll(QUOTE_CHARS_TO_ENCODE, new String[] {
                                                                "&",
                                                                "<",
                                                                ">",
                                                                "'"});
private static final String[] HTML_ENCODED_DANGEROUS_HTML_CHARS = 
(String[])ArrayUtils.addAll(HTML_ENCODED_QUOTE_CHARS, new String[] {
                                                                "&",
                                                                "<",
                                                                ">",
                                                                "'"}); // Note that we use the HTML escape for apostrophe (' is not valid HTML - it's XML/XHTML/SGML)

public static String htmlEscapeToSanitise(String input)
{
    return StringUtils.replaceEach(input, DANGEROUS_HTML_CHARS_TO_ENCODE, HTML_ENCODED_DANGEROUS_HTML_CHARS);
}

Original comment by RedYetiD...@gmail.com on 11 Aug 2011 at 12:54

GoogleCodeExporter commented 9 years ago

Are you suggesting not use Antisamy and use this approach? if not any way 
integrate it with Antisamy?

Original comment by job...@gmail.com on 12 Aug 2011 at 7:24

GoogleCodeExporter commented 9 years ago

No I'm certainly not suggesting to use this instead of AntiSamy. Home cooking 
HTML validation is not sensible - it's far too complex an area.

In fact this approach uses AntiSamy: The step that says; "Strip all HTML." 
should probably have been more explicit and actually read; "Strip all HTML 
/with AntiSamy/".

This is a work-around, not a fix. In other words I still have this problem and 
am waiting on a fix from the AntiSamy team.

Original comment by RedYetiD...@gmail.com on 12 Aug 2011 at 9:09

GoogleCodeExporter commented 9 years ago

There is another, slightly easier work around: after cleaning up with antiSamy, 
re-encode the content using org.apache.commons.lang.StringEscapeUtils

I'm frankly surprised that AntiSamy does not have such a feature. This makes it 
unusable for any CMS whose user language is not English. They'll see gibberish 
when they go to edit their content.

Original comment by husseini...@gmail.com on 12 Aug 2011 at 12:47

GoogleCodeExporter commented 9 years ago

I'd rather avoid turning this bug report into a thread on how to work around 
this.. however:

I may be missing something here Husseini but using StringEscapeUtils is just 
the same as using HtmlCharacterEntityReferences. They both result in unescaped 
HTML references.

So, not easier, just the same surely?

The extra step I then add is to make it safer by re-escaping angles and quotes.

So:

1) Clean all HTML with AntiSamy -> escaped HTML
2) Use either HtmlCharacterEntityReferences or StringEscapeUtils -> unescaped 
HTML
3) Optionally use the home-cooked sanitiser mentioned above

Original comment by RedYetiD...@gmail.com on 12 Aug 2011 at 1:01

GoogleCodeExporter commented 9 years ago

Actually DO NOT use the technique I suggested. It is completely unsafe. 
AntiSamy is practically useless here: Check the following string:

<script>alert("hello world");</script>

Using the technique I suggested is quite dangerous because the encode will 
encode the > and < and there you have it, XSS.

Original comment by husseini...@gmail.com on 19 Aug 2011 at 5:05

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I find lack of support for UTF-8 characters surprising.

I don't think support for none-English language characters should be a special 
requirement in this day and age.

Original comment by abitdo...@gmail.com on 19 Aug 2011 at 5:21

GoogleCodeExporter commented 9 years ago

Are you actually reading my replies to this thread?

Yes, the String: <script>alert("hello world");</script> when run through an 
AntiSamy then subsequently through an unsecape() call will remain dangerous.

Which is precisely why I posted the code above, that in the next reply I 
mentioned as "step 3)" which solves the problem since it encodes particularly 
dangerous characters.

But that's not any fault of Antisamy's; if we decide to call unescape() using 
some third party library on the result (and don't then re-encode the dangerous 
characters yourself) what can AntiSamy possibly do?

Anyhow, can we leave this here and wait for a proper fix?

Arshan? Any chance of this being fixed so we can stop talking about work 
around?? :)

Original comment by RedYetiD...@gmail.com on 19 Aug 2011 at 5:34

GoogleCodeExporter commented 9 years ago

And finally (I hope):

I'd very much prefer a blacklist approach - so I can specify just the 
characters I consider dangerous (as above) and have all other characters left 
alone. Without having to work out what those special characters are and pass 
them in.

Original comment by RedYetiD...@gmail.com on 19 Aug 2011 at 5:37

GoogleCodeExporter commented 9 years ago

Checked in a solution to HEAD: a new directive, "entityEncodeIntlChars" 
(default: false).

When true, "international" characters will be represented by their HTML 
entities as according to the HTML DTD. When false, they'll be echoed as-is, to 
the worry of the person who set this setting to true.

Original comment by arshan.d...@gmail.com on 16 Sep 2011 at 6:22

Changed state: Verified

GoogleCodeExporter commented 9 years ago

What is the status on the fix mentioned above? It was posted on sept 15th, but 
on the dowloads area is still the 1.4.4 version.

Original comment by ejjaq...@gmail.com on 2 Feb 2012 at 1:41

GoogleCodeExporter commented 9 years ago

So is this now included in the 1.5.1 version in the downloads area?

Original comment by RedYetiD...@gmail.com on 26 Mar 2013 at 11:08

rjatkins / owaspantisamy

Whitelist for special characters being untouched #101