UTF-8 encoding glitch? - Githubissues

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Change org.h2.DataPage.writeString(String s)  to read like so:

public void writeString(String s){
int len = s.length();
checkCapacity(len * 3 + 4);
byte[] encoded = org.h2.util.StringUtils.utf8Encode(s);
writeInt(len);
write(encoded, 0, encoded.length);
}

2. Save file, run unit tests
3. Observe error in unit test... 

What is the expected output? What do you see instead?
Expected: no error, since should be UTF-8 both ways
Instead: error from different lengths

Please use labels and text to provide additional information.
I suspect this is caused by UTF-8 allowing multiple encodings for single
characters, with different byte sizes, combined with custom handling of
UTF-8 data internally.  

May represent a security vulnerability, depending.

Severity:
Probably minor?

Original issue reported on code.google.com by buckyba...@gmail.com on 29 Jul 2009 at 4:39

GoogleCodeExporter commented 9 years ago

I don't understand. Why would you 
> Change org.h2.DataPage.writeString(String s)  to read like so?

Original comment by thomas.t...@gmail.com on 30 Jul 2009 at 2:46

GoogleCodeExporter commented 9 years ago

I was looking to see if it was faster, since I assumed they would be the same 
(both
encoding strings to UTF8 bytes). 

This is not something a user will encounter.  It (like the bug with LOBs) 
represents
a reaction of "huh, that's funny, should that happen?" when examining and 
working
with the code.

My concern is that it may open the door for vulnerabilities with regard to 
UTF-8 and
unusual encodings.  Two potential problems: multibyte encodings (longer than the
shortest legal encoding) and specifying invalid UTF-8 chars (things your 
routines may
accept, but should not).  

UTF-8 problems are fairly widespread and common vulnerabilities -- early JREs 
have
issues with this only recently discovered --
http://sunsolve.sun.com/search/document.do?assetkey=1-66-245246-1 

If you google "UTF-8 vulnerability" you'll see some a ton of other examples.  
Rolling
your own UTF-8 handling (as here) may be the way to go, it just needs to be 
checked
for problems from the two sets of routines conflicting.

I think both problems can be checked somewhat in unit tests by generating random
UTF-8 characters and random bytes within specific ranges, and then seeing how 
they
are handled. There may already by tests for this, I'd just like to confirm that 
there
are no issues with different UTF-8 handlings.

If you give me commit access, I'll add some tests.

Original comment by buckyba...@gmail.com on 30 Jul 2009 at 3:54

GoogleCodeExporter commented 9 years ago

H2 uses it's own storage format. It doesn't matter if it's UTF-8 or not,
because the format is private to H2. I don't see how the data format 
of H2 could be a vulnerability.

Please only open bugs for actual issues.

Original comment by thomas.t...@gmail.com on 30 Jul 2009 at 4:07

Changed state: Invalid

lbehnke / h2database

UTF-8 encoding glitch? #104