mchibouni / owasp-esapi-php

Automatically exported from code.google.com/p/owasp-esapi-php
Other
0 stars 0 forks source link

Codec::decode cannot accept a UTF-32 encoded empty string as decodedCharacter #27

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Problem:
Returning an empty string as the decodedCharacter to Codec::decode from,
for instance, CSSCodec::decodeCharacter, will cause Codec::decode to not
strip the encodedString portion from the original input string.

It is necessary to return an empty string as the decodedCharacter for
situations where the encodedString is to be effectively "ignored" (i.e.
simply stripped from original string, replaced with nothing).

The workaround at the moment manifests within CSSCodec::decodeCharacter
where (seemingly) a non-UTF-32 encoded space is returned to Codec::decode.
If a UTF-32 encoded space is returned as decodedCharacter then the
encodedString portion is removed from original string but replaced by a
space (as it should!).

Would be nice to see UTF-32 encoded empty string handled properly so as to
reinforce the contract to normalize all strings within Codecs to UTF-32. 

Files that need to be addressed for this issue:
Codec.php
CSSCodec.php
(possibly other specific Codec implementations in future)

Original issue reported on code.google.com by coreform on 2 Dec 2009 at 4:26

GoogleCodeExporter commented 9 years ago
I think I've fixed this in r508.

An empty string has no characters, therefore it can have no character encoding.
It follows that anywhere we do:
$encodedOutput = mb_convert_encoding("", SOME_CHARACTER_ENCODING);
we could simply do:
$encodedOutput = '';

The reason that '' wasn't being handled as desired (i.e. ignoring the encoded 
portion
of a string or, more accurately, forcing the removal of the encoded portion 
from the
normalized input string and adding nothing to the decoded string) was because in
Codec.decode an empty string was being treated the same way as null.
'' == null is true.
The fix is simply to:
if ($decodedCharacter !== null)
and then we can return an empty string whenever we want to strip a character 
from the
string.  Again, it's not necessary to apply character encoding to an empty 
string.

As a side note: the only reason that returning a single space was a good 
workaround
for this issue was because it was returned as a single byte character and the
conversion from UTF-32 yielded an empty string.

Original comment by jahboite@gmail.com on 16 Feb 2010 at 4:12