unicodedata_UCD_lookup() has theoretical buffer overflow

tiran commented 9 years ago

BPO	23997
Nosy	@malemburg, @pitrou, @vstinner, @tiran, @benjaminp, @ezio-melotti, @serhiy-storchaka
Files	unicode_name_maxlen.patch unicode_name_maxlen_trunc.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['extension-modules', 'type-bug'] title = 'unicodedata_UCD_lookup() has theoretical buffer overflow' updated_at = user = 'https://github.com/tiran' ``` bugs.python.org fields: ```python activity = actor = 'serhiy.storchaka' assignee = 'none' closed = False closed_date = None closer = None components = ['Extension Modules'] creation = creator = 'christian.heimes' dependencies = [] files = ['39109', '41365'] hgrepos = [] issue_num = 23997 keywords = ['patch'] message_count = 2.0 messages = ['241461', '256744'] nosy_count = 7.0 nosy_names = ['lemburg', 'pitrou', 'vstinner', 'christian.heimes', 'benjamin.peterson', 'ezio.melotti', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue23997' versions = ['Python 2.7', 'Python 3.5', 'Python 3.6'] ```

tiran commented 9 years ago

Coverity has found a potential buffer overflow in the unicodedata module. The function call _getcode() which calls _cmpname(). _cmpname() copies data into fixed size buffer of length NAME_MAXLEN. Neither lookup() nor _getcode() limit name_length to NAME_MAXLEN. Therefore the buffer could theoretical overflow.

In practice the buffer overflow can't be abused because Tools/unicode/makeunicodedata.py already limits max name length. I still like to fix the bug because it is a low hanging fruit. In most versions of Python the code already checks that name_length fits in INT_MAX.

CID 1295028 (#1 of 1): Out-of-bounds access (OVERRUN) overrun-call: Overrunning callee's array of size 256 by passing argument (int)name_length (which evaluates to 2147483647) in call to _getcode

serhiy-storchaka commented 8 years ago

For now the error message virtually always contains the name (unless the length of its UTF-8 representation > INT_MAX). With unicode_name_maxlen.patch it doesn't contains the name of length few hundreds or tens characters.

Proposed patch makes the error message always contain the name, but truncated to NAME_MAXLEN bytes.

>>> name = ''.join(map(chr, range(0x2c80, 0x2ce4)))
>>> unicodedata.lookup(name)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: "undefined character name 'ⲀⲁⲂⲃⲄⲅⲆⲇⲈⲉⲊⲋⲌⲍⲎⲏⲐⲑⲒⲓⲔⲕⲖⲗⲘⲙⲚⲛⲜⲝⲞⲟⲠⲡⲢⲣⲤⲥⲦⲧⲨⲩⲪⲫⲬⲭⲮⲯⲰⲱⲲⲳⲴⲵⲶⲷⲸⲹⲺⲻⲼⲽⲾⲿⳀⳁⳂⳃⳄⳅⳆⳇⳈⳉⳊⳋⳌⳍⳎⳏⳐⳑⳒⳓⳔ�...'"

python / cpython

unicodedata_UCD_lookup() has theoretical buffer overflow #68185