Crypto routine failure with UTF-16 surrogate pair in password

jborean93 commented 4 years ago

When trying to get a credential for a password that contains a UTF-16 surrogate pair char like 𝄞 it fails with

Major (851968): Unspecified GSS failure...nor (1314127875): Crypto routine failure

This seems to be a problem with the libunistring library when trying to convert UTF-8 bytes to the UCS-2LE encoding. I have no idea if this can be fixed or whether we should really care but technically I can create a username and password in Windows with these characters in the value and authenticate with them using NTLM.

To replicate this problem run

import gssapi

ntlm = gssapi.OID.from_int_seq('1.3.6.1.4.1.311.2.2.10')

username = b"User\xF0\x9D\x84\x9E".decode('utf-8')  # 𝄞 in UTF-8
password = b"Pass\xF0\x9D\x84\x9E"
username = gssapi.Name(username, name_type=gssapi.NameType.user)
cred = gssapi.raw.acquire_cred_with_password(username, password, usage='initiate', mechs=[ntlm])

It seems like this step and the same for NTOWFv1 is where the problem occurs but I don't fully understand how libunistring really works to see if there is a workaround or whether this should be raised there.

Even if we get past this and don't use these types of chars for the password, the code also fails when generating or parsing an authenticate message with a username like this. I haven't looked into the code to see what this may be but I would guess it's a similar situation with the password.

If you wish to try and fix this I'm happy to supply a way to set up a local user with a char that becomes a surrogate pair on Windows as I've tested this out with a Python NTLM implementation I have.

simo5 commented 4 years ago

I think this is becoe UCS2-LE simply cannot represent surrogate pairs, i guess we'll have to move to UTF-16, which seem what Microsoft has moved to in their OSs, to handle Unicode characters once the set grew past what UCS2 could handle.

simo5 commented 4 years ago

The difficulty in this is to change the code in various places, as now utf8 string generated can be much bigger as the utf8 representation can be longer than the string *2 ...

simo5 commented 4 years ago

In fact I think (hope) the worst case is len(utf8 string) = 3* len(utf16 string)

simo5 commented 4 years ago

actually an utf8 string is never more than twice a utf16 string in size when counting in bytes. when counting the number of code points utf8 sometimes uses 3 codepoints to represent what utf16 can represent with a single code point, but surrogate pairs in ut16 are always represented with max 4 bytes in utf8 as well.

gssapi / gss-ntlmssp

Crypto routine failure with UTF-16 surrogate pair in password #20