python / cpython

The Python programming language
https://www.python.org
Other
63.33k stars 30.32k forks source link

Hash collision security issue #57912

Closed warsaw closed 12 years ago

warsaw commented 12 years ago
BPO 13703
Nosy @malemburg, @gvanrossum, @tim-one, @loewis, @warsaw, @birkenfeld, @terryjreedy, @gpshead, @jcea, @mdickinson, @pitrou, @tiran, @benjaminp, @serwy, @merwok, @alex, @skrah, @davidmalcolm, @bz2, @markshannon, @ericsnowcurrently, @jimjjewett, @PaulMcMillan
Dependencies
  • bpo-13704: Random number generator in Python core
  • Files
  • hash-attack.patch
  • SafeDict.py: SafeDict implementation
  • bench_startup.py
  • random-8.patch
  • hash-collision-counting-dmalcolm-2012-01-20-001.patch
  • amortized-probe-counting-dmalcolm-2012-01-20-002.patch
  • amortized-probe-counting-dmalcolm-2012-01-21-003.patch
  • hash-attack-2.patch
  • hash-attack-3.patch
  • integercollision.py
  • backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch: Backport of haypo's random-8.patch to 2.7
  • hybrid-approach-dmalcolm-2012-01-25-001.patch: Hybrid approach to solving dict DoS attack
  • hybrid-approach-dmalcolm-2012-01-25-002.patch
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-27-001.patch
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch
  • unnamed
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch
  • optin-hash-randomization-for-2.6-dmalcolm-2012-01-30-001.patch
  • results-16.txt
  • add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch
  • fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch
  • add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch
  • fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch
  • add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch
  • fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch
  • add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch
  • fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch
  • add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch
  • add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch
  • add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch
  • add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch
  • hash-patch-3.1-gb-03.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-security', 'interpreter-core', 'release-blocker', '3.11'] title = 'Hash collision security issue' updated_at = user = 'https://github.com/warsaw' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = True closed_date = closer = 'gregory.p.smith' components = ['Interpreter Core'] creation = creator = 'barry' dependencies = ['13704'] files = ['24151', '24169', '24223', '24259', '24286', '24288', '24289', '24295', '24299', '24300', '24304', '24320', '24324', '24343', '24353', '24365', '24366', '24370', '24371', '24375', '24385', '24391', '24392', '24393', '24394', '24434', '24435', '24436', '24437', '24490', '24491', '24514', '24515', '24563'] hgrepos = [] issue_num = 13703 keywords = ['patch'] message_count = 328.0 messages = ['150522', '150525', '150526', '150529', '150531', '150532', '150533', '150534', '150541', '150543', '150558', '150559', '150560', '150562', '150563', '150565', '150568', '150569', '150570', '150577', '150589', '150592', '150601', '150609', '150613', '150616', '150619', '150620', '150621', '150622', '150625', '150634', '150635', '150636', '150637', '150638', '150639', '150641', '150642', '150643', '150644', '150645', '150646', '150647', '150648', '150649', '150650', '150651', '150652', '150655', '150656', '150659', '150662', '150665', '150668', '150694', '150699', '150702', '150706', '150707', '150708', '150712', '150713', '150718', '150719', '150724', '150725', '150726', '150727', '150738', '150748', '150756', '150766', '150768', '150769', '150771', '150795', '150829', '150832', '150835', '150836', '150840', '150847', '150856', '150857', '150859', '150865', '150866', '150934', '151012', '151017', '151031', '151033', '151047', '151048', '151061', '151062', '151063', '151064', '151065', '151069', '151070', '151071', '151073', '151074', '151078', '151092', '151120', '151121', '151122', '151157', '151158', '151159', '151167', '151353', '151401', '151402', '151419', '151422', '151448', '151449', '151468', '151472', '151474', '151484', '151519', '151528', '151560', '151561', '151565', '151566', '151567', '151574', '151582', '151583', '151584', '151585', '151586', '151589', '151590', '151596', '151604', '151617', '151620', '151625', '151626', '151628', '151629', '151632', '151633', '151647', '151662', '151664', '151677', '151679', '151680', '151681', '151682', '151684', '151685', '151689', '151691', '151699', '151700', '151701', '151703', '151707', '151714', '151731', '151734', '151735', '151737', '151739', '151744', '151745', '151747', '151748', '151753', '151754', '151756', '151758', '151794', '151796', '151798', '151812', '151813', '151814', '151815', '151825', '151826', '151847', '151850', '151867', '151869', '151870', '151939', '151941', '151942', '151944', '151956', '151959', '151960', '151961', '151965', '151966', '151967', '151970', '151973', '151977', '151984', '152030', '152033', '152037', '152039', '152040', '152041', '152043', '152046', '152051', '152057', '152060', '152066', '152070', '152104', '152112', '152117', '152118', '152125', '152146', '152149', '152183', '152186', '152199', '152200', '152203', '152204', '152270', '152271', '152275', '152276', '152299', '152300', '152309', '152311', '152315', '152335', '152344', '152352', '152362', '152364', '152422', '152452', '152453', '152723', '152730', '152731', '152732', '152734', '152740', '152747', '152753', '152754', '152755', '152758', '152760', '152763', '152764', '152767', '152768', '152769', '152777', '152780', '152781', '152784', '152787', '152789', '152797', '152811', '152855', '153055', '153074', '153081', '153082', '153140', '153141', '153143', '153144', '153297', '153301', '153369', '153395', '153682', '153683', '153690', '153695', '153750', '153753', '153798', '153802', '153817', '153833', '153848', '153849', '153850', '153852', '153853', '153854', '153860', '153861', '153862', '153868', '153872', '153873', '153877', '153975', '153980', '154428', '154430', '154432', '154853', '155293', '155472', '155527', '155680', '155681', '155682', '405727', '405745'] nosy_count = 36.0 nosy_names = ['lemburg', 'gvanrossum', 'tim.peters', 'loewis', 'barry', 'georg.brandl', 'terry.reedy', 'gregory.p.smith', 'jcea', 'mark.dickinson', 'pitrou', 'christian.heimes', 'benjamin.peterson', 'roger.serwy', 'eric.araujo', 'grahamd', 'Arfrever', 'v+python', 'alex', 'cvrebert', 'zbysz', 'skrah', 'dmalcolm', 'gz', 'neologix', 'Arach', 'Mark.Shannon', 'python-dev', 'eric.snow', 'Zhiping.Deng', 'Huzaifa.Sidhpurwala', 'Jim.Jewett', 'PaulMcMillan', 'fx5', 'skorgu', 'jsvaughan'] pr_nums = [] priority = 'release blocker' resolution = 'fixed' stage = 'needs patch' status = 'closed' superseder = None type = 'security' url = 'https://bugs.python.org/issue13703' versions = ['Python 3.11'] ```

    warsaw commented 12 years ago

    This is already publicly known and in deep discussion on python-dev. The proper fix is still TBD. Essentially, hash collisions can be exploited to DoS a web framework that automatically parses input forms into dictionaries.

    Start here:

    http://mail.python.org/pipermail/python-dev/2011-December/115116.html

    tiran commented 12 years ago

    I had a short chat with Guido yesterday. I'll try to sum up the conversation. Guido, please correct me if I got something wrong or missed a point.

    Guido wants the fix as simple and less intrusive as possible as he wants to provide/apply a patch for Python 2.4 to 3.3. This means any new stuff is off the table unless it's really, really necessary. Say goodbye to my experimental MurmurHash3 patch.

    We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release. The env var PYTHONRANDOMHASH=1 would enable the randomization. It's simple to set the env var in e.g. Apache for mod_python and mod_wsgi.

    pitrou commented 12 years ago

    We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release.

    I think on the contrary it must be enabled by default. Leaving security holes open is wrong.

    tiran commented 12 years ago

    I think on the contrary it must be enabled by default. Leaving security holes open is wrong.

    We can't foresee the implications of the randomization and only a small number of deployments is affected by the problem. But I won't start a fight on the matter. ;)

    gvanrossum commented 12 years ago

    I'm with Antoine -- turn it on by default. Maybe there should be a release candidate to test the waters.

    warsaw commented 12 years ago

    On Jan 03, 2012, at 08:24 PM, Antoine Pitrou wrote:

    I think on the contrary it must be enabled by default. Leaving security holes open is wrong.

    Unless there's evidence of performance regressions or backward incompatibilities, I agree.

    vstinner commented 12 years ago

    Unless there's evidence of performance regressions or backward incompatibilities, I agree.

    If hash() is modified, str(dict) and str(set) will change for example. It may break doctests. Can we consider that the application should not rely (indirectly) on hash and so fix (for example) their doctests? Or is it a backward incompatibility?

    hash() was already modified in major Python versions.

    For this specific issue, I consider that security is more important than str(dict).

    benjaminp commented 12 years ago

    Barry, when this gets fixed, shall we coordinate release times?

    warsaw commented 12 years ago

    On Jan 03, 2012, at 09:43 PM, Benjamin Peterson wrote:

    Barry, when this gets fixed, shall we coordinate release times?

    Yes!

    tiran commented 12 years ago

    Randomized hashing destabilizes the unit tests of Python, too. Here are the outputs of four test runs:

    11 tests failed: test_collections test_dbm test_dis test_gdb test_inspect test_packaging test_set test_symtable test_ttk_textonly test_urllib test_urlparse

    9 tests failed: test_dbm test_dis test_gdb test_json test_packaging test_set test_symtable test_urllib test_urlparse

    10 tests failed: test_dbm test_dis test_gdb test_inspect test_packaging test_set test_symtable test_ttk_textonly test_urllib test_urlparse

    9 tests failed: test_collections test_dbm test_dict test_dis test_gdb test_packaging test_symtable test_urllib test_urlparse

    3cbf7d86-addd-4bfa-8721-7e95bc50ded0 commented 12 years ago

    I agree that we should enable randomness by default, and provide an easy way for users to disable it if necessary (unit test suites that explicitly depend on order being an obvious candidate).

    I'll link my proposed algorithm change here, for the record: https://gist.github.com/0a91e52efa74f61858b5

    I've gotten confirmation from several other sources that the fix recommended by the presenters (just a random initialization seed) only prevents the most basic form of the attack.

    vstinner commented 12 years ago

    Christian Heimes proposes the following change in its randomhash branch (see issue bpo-13704):

    -    x = (Py_uhash_t) *p << 7;
    +    x = Py_RndHashSeed + ((Py_uhash_t) *p << 7);
         for (i = 0; i < len; i++)
             x = (1000003U * x) ^ (Py_uhash_t) *p++;
         x ^= (Py_uhash_t) len;

    This change doesn't add any security if the attacker can inject any string and retreive the hash value. You can retreive directly Py_RndHashSeed using:

    Py_RndHashSeed = intmask((hash("a") ^ len("a") ^ ord("a")) * DIVIDE) - (ord("a") << 7)

    where intmask() truncates to a long (x mod 2^(long bits)) and DIVIDE = 1/1000003 mod 2^(long bits). For example, DIVIDE=2021759595 for 32 bits long.

    tiran commented 12 years ago

    Victor, please ignore my code related to hash randomization for now. I've deliberately not linked my branch to this bug report. I'm well aware that it's not secure and that it's pretty easy to reverse engineer the seed from a hash of a short string. The code is a proof of concept to detect failing tests and other issues.

    I'm in private contact with Paul and we are working together. He has done extended research and I'll gladly follow his expertise. I've already discussed the issue with small strings, but I can't recall if it was a private mail to Paul or a public one to the dev list.

    Paul: I still think that you should special case short strings (five or few chars sound good). An attacker can't do much harm with one to five char strings but such short strings may make it too easy to calculate the seed.

    16kb of seed is still a lot. Most CPUs have about 16 to 32, maybe 64kb L1 cache for data. 1024 to 4096 bytes should increase cache locality and reduce speed impacts.

    PS: I'm going to reply to your last mail tomorrow.

    terryjreedy commented 12 years ago

    In bpo-13707 I suggest a change to the current hash() entry which is needed independently of this issue, because the default hash (for object()), being tied to id() is already limited to an object's lifetime. But this change will become more imperative if hash() is made run-dependent for numbers and strings.

    There does not seems to presently *be* a security hole for 64 bit builds, so if there is any noticeable slowdown on 64 bit builds and it is sensibly easy to tie the default to the bitness, I would think it should be off for such builds.

    vstinner commented 12 years ago

    Paul first proposition (on python-dev) was to replace:

    ...
    x = (ord(s[0]) << 7)
    while i < length:
        x = intmask((1000003*x) ^ ord(s[i]))
        ...

    by:

    ...
    x = (ord(s[0]) << 7)
    while i < length:
        x = intmask((1000003*x) ^ ord(s[i])) ^ r[x % len_r]
        ...

    This change has a vulnerability similar than the one of Christian's suggested changed. The "r" array can be retreived directly with:

    r2 = []
    for i in xrange(len(r)):
        s = chr(intmask(i * UNSHIFT7) % len(r))
        h = intmask(hash(s) ^ len(s) ^ ord(s) ^ ((ord(s) << 7) * MOD))
        r2.append(chr(h))
    r2 = ''.join(r2)

    where UNSHIFT7 = 1/2**7 mod 2^(long bits).

    By the way, this change always use r[0] to hash all string of one ASCII character (U+0000-U+007F).

    pitrou commented 12 years ago

    I'm in private contact with Paul and we are working together. He has done extended research and I'll gladly follow his expertise. I've already discussed the issue with small strings, but I can't recall if it was a private mail to Paul or a public one to the dev list.

    Can all this be discussed on this issue now that it's the official point of reference? It will avoid the repetition of arguments we see here and there.

    (I don't think special-casing small strings makes sense, because then you have two algorithms to audit rather than one)

    vstinner commented 12 years ago

    https://gist.github.com/0a91e52efa74f61858b5

    Please, attach directly a file to the issue, or copy/paste the code in your comment. Interesting part the code: ---

    Proposed replacement

    --------------------------------------

    import os, array
    size_exponent = 14 #adjust as a memory/security tradeoff
    r = array.array('l', os.urandom(2**size_exponent))
    len_r = len(r)
    
    def _hash_string2(s):
        """The algorithm behind compute_hash() for a string or a unicode."""
        length = len(s)
        #print s
        if length == 0:
            return -1
        x = (ord(s[0]) << 7) ^ r[length % len_r]
        i = 0
        while i < length:
            x = intmask((1000003*x) ^ ord(s[i]))
            x ^= r[x % len_r]
            i += 1
        x ^= length
        return intmask(x)

    r = array.array('l', os.urandom(2**size_exponent)) len_r = len(r)

    r size should not depend on the size of a long. You should write something like:

    sizeof_long = ctypes.sizeof(ctypes.c_long)
    r_bits = 8
    r = array.array('l', os.urandom((2**r_bits) * sizeof_long))
    r_mask = 2**r_bits-1

    and then replace "% len_r" by "& r_mask".

    What is the minimum value of r_bits? For example, would it be safe to use a single long integer? (r_bits=1)

    pitrou commented 12 years ago

    > r = array.array('l', os.urandom(2**size_exponent)) > len_r = len(r)

    r size should not depend on the size of a long. You should write something like:

    sizeof_long = ctypes.sizeof(ctypes.c_long) r_bits = 8 r = array.array('l', os.urandom((2**r_bits) sizeof_long)) r_mask = 2*\r_bits-1

    The final code will be in C and will use neither ctypes nor array.array. Arguing about this looks quite pointless IMO.

    pitrou commented 12 years ago

    For the record, here is what "man urandom" says about random seed size:

    “[...] no cryptographic primitive available today can hope to promise more than 256 bits of security, so if any program reads more than 256 bits (32 bytes) from the kernel random pool per invocation, or per reasonable reseed interval (not less than one minute), that should be taken as a sign that its cryptography is not skilfully implemented.”

    In that light, reading a 64 bytes seed from /dev/urandom is already a lot, and 4096 bytes is simply insane.

    vstinner commented 12 years ago

    I read that the attack cannot be computed with actual computers (it's too expensive) against Python 64 bits. I tried to change str.__hash__ in Python 32 bits to compute the hash in 64 bits and than truncate the hash to 32 bits: it doesn't change anything, the hash values are the same, so it doesn't improve the security.

    vstinner commented 12 years ago

    Yet another random hash function, simplified version of Paul's function. It always use exactly 256 bits of entropy and so 32 bytes of memory, and doesn't keep the loop. I don't expect my function to be secure, but just give more work to the attacker to compute the data for an attack against our dict implementation.

    ---

    import os, array, struct
    sizeof_long = struct.calcsize("l")
    r_bits = 256
    r_len = r_bits // (sizeof_long * 8)
    r_mask = r_len - 1
    r = array.array('l', os.urandom(r_len * sizeof_long))
    
    def randomhash(s):
        length = len(s)
        if length == 0:
            return -2
        x = ord(s[0])
        x ^= r[x & r_mask]
        x <<= 7
        for ch in s:
            x = intmask(1000003 * x)
            x ^= ord(ch)
        x ^= length
        x ^= r[x & r_mask]
        return intmask(x)

    The first "x ^= r[x & r_mask]" may be replaced by "x ^= r[(x ^ length) & r_mask]".

    The binary shift is done after the first xor with r, because 2**7 and r_len are not coprime (r_len is a multipler of 2**7), and so (ord(s[0] \<\< 7) & r_mask is always zero.

    randomhash(s)==hash(s) if we used twice the same index in the r array. I don't know if this case gives useful information.

    3cbf7d86-addd-4bfa-8721-7e95bc50ded0 commented 12 years ago

    A couple of things here:

    First, my proposed change is not cryptographically secure. There simply aren't any cryptographic hashing algorithms available that are in the performance class we need. My proposal does make the collision attack quite difficult to carry out, even if the raw output values from the hash are available to the attacker (they should not usually be).

    I favor using the same algorithm between 32 and 64 bit builds for consistency of behavior. Developers would be startled to find that ordering stays consistent on a 64 bit build but varies on 32 bit builds. Additionally, the impracticality of attacking of 64 bit builds rests on the fact that these particular researchers didn't devise a way to do it. I'd hate to make this change and then have a clever mathematician publish some elegant point requiring us to go fix the problem all over again.

    I could be convinced either way on small strings. I like that they can't be used to attack the secret. At the same time, I worry that combining 2 different hashing routines into the same output space may introduce unexpected collisions and other difficult to debug edge-case conditions. It also means that the order of the hashes of long strings will vary while the order of short strings will not - another inconsistency which will encourage bugs.

    Thank you Victor for the improvements to the python demonstration code. As Antoine said, it's only demo code, but better demo code is always good.

    Antoine: That section of the manpage is referring to the overall security of a key generated using urandom. 256 bits is overkill for this application. We could take 256 bits and use them to generate a key using a cryptographically appropriate algorithm, but it's simpler to read more bits and use them directly as the key.

    Additionally, that verbiage has been in the man page for urandom for quite some time (probably since the earliest version in the mid 90's). The PRNG has been improved since then.

    Minimum length of r is a hard question. The shorter it is, the higher the correlation of the output. In my tests, 16kb was the amount necessary to generally do reasonably well on my test suite for randomness even with problematic input. Obviously our existing output isn't random, so it doesn't pass those tests at all. Using a fairly small value (4k) should not make the results much worse from a security perspective, but might be problematic from a collision/distribution standpoint. It's clear that we don't need cryptographically good randomness here, but passing the test suite is not a bad thing when considering the distribution.

    When we settle on a C implementation, I'd like to run it through the smhasher set of tests to make sure we aren't making distribution worse, especially for very small values of r.

    pitrou commented 12 years ago

    Using a fairly small value (4k) should not make the results much worse from a security perspective, but might be problematic from a collision/distribution standpoint.

    Keep in mind the average L1 data cache size is between 16KB and 64KB. 4KB is already a significant chunk of that.

    Given a hash function's typical loop is to feed back the current result into the next computation, I don't see why a small value (e.g. 256 bytes) would be detrimental.

    merwok commented 12 years ago

    If test_packaging fails because it relies on dict order / hash details, that’s a bug. Can you copy the full tb (possibly in another report, I can fix it independently of this issue)?

    warsaw commented 12 years ago

    On Jan 04, 2012, at 06:00 AM, Paul McMillan wrote:

    Developers would be startled to find that ordering stays consistent on a 64 bit build but varies on 32 bit builds.

    Well, one positive outcome of this issue is that users will finally viscerally understand that dictionary (and set) order should never be relied upon, even between successive runs of the same Python executable.

    malemburg commented 12 years ago

    Some comments:

    1. The security implications in all this is being somewhat overemphasized.

    There are many ways you can do a DoS attack on web servers. It's the responsibility of the used web frameworks and servers to deal with the possible cases.

    It's a good idea to provide some way to protect against hash collision attacks, but that will only solve one possible way of causing a resource attack on a server.

    There are other ways you can generate lots of CPU overhead with little data input (e.g. think of targeting the search feature on many Zope/Plone sites).

    In order to protect against such attacks in general, we'd have to provide a way to control CPU time and e.g. raise an exception if too much time is being spent on a simple operation such as a key insertion. This can be done using timers, signals or even under OS control.

    The easiest way to protect against the hash collision attack is by limiting the POST/GET/HEAD request size.

    The second best way would be to limit the number of parameters that a web framework accepts for POST/GET/HEAD request.

    1. Changing the semantics of hashing in a dot release is not allowed.

    If randomization of the hash start vector or some other method is enabled by default in a dot release, this will change the semantics of any application switching to that dot release.

    The hash values of Python objects are not only used by the Python dictionary implementation, but also by other storage mechanisms such as on-disk dictionaries, inter-process object exchange via share memory, memcache, etc.

    Hence, if changed, the hash change should be disabled per default for dot releases and enabled for 3.3.

    1. Changing the way strings are hashed doesn't solve the problem.

    Hash values of other types can easily be guessed as well, e.g. take integers which use a trivial hash function.

    We'd have to adapt all hash functions of the basic types in Python or come up with a generic solution using e.g. double-hashing in the dictionary/set implementations.

    1. By just using a random start vector you change the absolute hash values for specific objects, but not the overall hash sequence or its period.

    An attacker only needs to create many hash collisions, not specific ones. It's the period of the hash function that's important in such attacks and that doesn't change when moving to a different start vector.

    1. Hashing needs to be fast.

    It's one of the most used operations in Python. Please get experts into the boat like Tim Peters and Christian Tismer, who both have worked on the dict implementation and the hash functions, before experimenting with ad-hoc fixes.

    1. Counting collisions could solve the issue without having to change hashing.

    Another idea would be counting the collisions and raising an exception if the number of collisions exceed a certain threshold.

    Such a change would work for all hashable Python objects and protect against the attack without changing any hash function.

    Thanks, -- Marc-Andre Lemburg eGenix.com


    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

    malemburg commented 12 years ago

    Marc-Andre Lemburg wrote:

    1. Changing the way strings are hashed doesn't solve the problem.

    Hash values of other types can easily be guessed as well, e.g. take integers which use a trivial hash function.

    Here's an example for integers on a 64-bit machine:

    >> g = ((x(2**64 - 1), hash(x(2**64 - 1))) for x in xrange(1, 1000000)) >> d = dict(g)

    This takes ages to complete and only uses very little memory. The input data has some 32MB if written down in decimal numbers

    32397634

    malemburg commented 12 years ago

    The email interface ate part of my reply:

    >>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000000))
    >>> s = ''.join(str(x) for x in g)
    >>> len(s)
    32397634
    >>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000000))
    >>> d = dict(g)
    ... lots of time for coffee, pizza, taking a walk, etc. :-)
    terryjreedy commented 12 years ago

    To expand on Marc-Andre's point 1: the DOS attack on web servers is possible because servers are generally dumb at the first stage. Upon receiving a post request, all key=value pairs are mindlessly packaged into a hash table that is then passed on to a page handler that typically ignores the invalid keys.

    However, most pages do not need any key,value pairs and forms that do have a pre-defined set of expected and recognized keys. If there were a possibly empty set of keys associated with each page, and the set were checked against posted keys, then a DOS post with thousands of effectively random keys could quickly (in O(1) time) be rejected as erroneous.

    In Python, the same effect could be accomplished by associating a class with slots with each page and having the server create an instance of the class. Attempts to create an undefined attribute would then raise an exception. Either way, checking input data for face validity before processing it in a time-consuming way is one possible solution for nearly all web pages and at least some other applications.

    alex commented 12 years ago

    Except, it's a totally non-scalable approach. People have vulnerabilities all over their sites which they don't realize. Some examples:

    django-taggit (an application I wrote for handling tags) parses tags out an input, it stores these in a set to check for duplicates. It's vulnerable.

    Another site I'm writing accepts JSON POSTs, you can put arbitrary keys in the JSON. It's vulnerable.

    malemburg commented 12 years ago

    Marc-Andre Lemburg wrote:

    1. The security implications in all this is being somewhat overemphasized.

    There are many ways you can do a DoS attack on web servers. It's the responsibility of the used web frameworks and servers to deal with the possible cases.

    It's a good idea to provide some way to protect against hash collision attacks, but that will only solve one possible way of causing a resource attack on a server.

    There are other ways you can generate lots of CPU overhead with little data input (e.g. think of targeting the search feature on many Zope/Plone sites).

    In order to protect against such attacks in general, we'd have to provide a way to control CPU time and e.g. raise an exception if too much time is being spent on a simple operation such as a key insertion. This can be done using timers, signals or even under OS control.

    The easiest way to protect against the hash collision attack is by limiting the POST/GET/HEAD request size.

    For GET and HEAD, web servers normally already apply such limitations at rather low levels:

    http://stackoverflow.com/questions/686217/maximum-on-http-header-values

    So only HTTP methods which carry data in the body part of the HTTP request are effected, e.g. POST and various WebDAV methods.

    The second best way would be to limit the number of parameters that a web framework accepts for POST/GET/HEAD request.

    Depending on how parsers are implemented, applications taking XML/JSON/XML-RPC/etc. as data input may also be vulnerable, e.g. non validating XML parsers which place element attributes into a dictionary or a JSON parser that has to read the JSON version of the dict I generated earlier on.

    vstinner commented 12 years ago

    Work-in-progress patch implementing my randomized hash function (random.patch):

    Notes:

    TODO:

    vstinner commented 12 years ago

    add PyOS_URandom() using CryptoGen, SSL (only on VMS!!) or /dev/urandom

    Oh, OpenSSL (RAND_pseudo_bytes) should be used on Windows, Linux, Mac OS X, etc. if OpenSSL is available. I was just too lazy to add a define or pyconfig.h option to indicate if OpenSSL is available or not. FYI RAND_pseudo_bytes() is now exposed in the ssl module of Python 3.3.

    will a fallback on a dummy LCG

    It's the Linear congruent generator (LCG) used by Microsoft Visual C++ and PHP:

    x(n+1) = (x(n) * 214013 + 2531011) % 2^32

    I only use bits 23..16 (bits 15..0 are not really random).

    pitrou commented 12 years ago

    > add PyOS_URandom() using CryptoGen, SSL (only on VMS!!) > or /dev/urandom

    Oh, OpenSSL (RAND_pseudo_bytes) should be used on Windows, Linux, Mac OS X, etc. if OpenSSL is available.

    Apart from the large dependency, the OpenSSL license is not GPL-compatible which may be a problem for some Python-embedding applications: http://en.wikipedia.org/wiki/OpenSSL#Licensing

    > will a fallback on a dummy LCG

    It's the Linear congruent generator (LCG) used by Microsoft Visual C++ and PHP:

    x(n+1) = (x(n) * 214013 + 2531011) % 2^32

    I only use bits 23..16 (bits 15..0 are not really random).

    If PHP uses it, I'm confident it is secure.

    vstinner commented 12 years ago

    + printf("read %i bytes\n", size);

    Oops, I forgot a debug message.

    vstinner commented 12 years ago

    If PHP uses it, I'm confident it is secure.

    If I remember correctly, it is only used for the Windows version of PHP, but PHP doesn't implement it correctly because it uses all bits.

    3cbf7d86-addd-4bfa-8721-7e95bc50ded0 commented 12 years ago

    This is not something that can be fixed by limiting the size of POST/GET.

    Parsing documents (even offline) can generate these problems. I can create books that calibre (a Python-based ebook format shifting tool) can't convert, but are otherwise perfectly valid for non-python devices. If I'm allowed to insert usernames into a database and you ever retrieve those in a dict, you're vulnerable. If I can post things one at a time that eventually get parsed into a dict (like the tag example), you're vulnerable. I can generate web traffic that creates log files that are unparsable (even offline) in Python if dicts are used anywhere. Any application that accepts data from users needs to be considered.

    Even if the web framework has a dictionary implementation that randomizes the hashes so it's not vulnerable, the entire python standard library uses dicts all over the place. If this is a problem which must be fixed by the framework, they must reinvent every standard library function they hope to use.

    Any non-trivial python application which parses data needs the fix. The entire standard library needs the fix if is to be relied upon by applications which accept data. It makes sense to fix Python.

    Of course we must fix all the basic hashing functions in python, not just the string hash. There aren't that many.

    Marc-Andre: If you look at my proposed code, you'll notice that we do more than simply shift the period of the hash. It's not trivial for an attacker to create colliding hash functions without knowing the key.

    Since speed is a concern, I think that the proposal to avoid using the random hash for short strings is a good idea. Additionally, randomizing only some of the characters in longer strings will allow us to improve security without compromising speed significantly.

    I suggest that we don't randomize strings shorter than 6 characters. For longer strings, we randomize the first and last 5 characters. This means we're only adding additional work to a max of 10 rounds of the hash, and only for longer strings. Collisions with the hash from short strings should be minimal.

    vstinner commented 12 years ago

    "Since speed is a concern, I think that the proposal to avoid using the random hash for short strings is a good idea."

    My proposition only adds two XOR to hash(str) (outside the loop on Unicode characters), so I expect a ridiculous overhead. I don't know yet how hard it is to guess the secret from hash(str) output.

    tiran commented 12 years ago

    Thanks Victor!

    • hash(str) is now randomized using two random Py_hash_t values: don't touch the critical loop, only add a prefix and a suffix

    At least for Python 2.x hash(str) and hash(unicode) have to yield the same result for ASCII only strings.

    • PyOS_URandom() raises exceptions whereas it is called before creating the interpreter state. I suppose that it cannot work like this.

    My patch compensates for the issue and calls Py_FatalError() when the random seed hasn't been initialized yet.

    You aren't special casing small strings. I fear that an attacker may guess the seed from several small strings. How about using another initial seed for strings shorter than 4 code points?

    pitrou commented 12 years ago

    You aren't special casing small strings. I fear that an attacker may guess the seed from several small strings.

    How would (s)he do?

    3cbf7d86-addd-4bfa-8721-7e95bc50ded0 commented 12 years ago

    My proposition only adds two XOR to hash(str) (outside the loop on Unicode characters), so I expect a ridiculous overhead. I don't know yet how hard it is to guess the secret from hash(str) output.

    It doesn't work much better than a single random seed. Calculating the hash of a null byte gives you the xor of your two seeds. An attacker can still cause collisions inside the vulnerable hash function, your change doesn't negate those internal collisions. Also, strings of all null bytes collide trivially.

    vstinner commented 12 years ago

    I fear that an attacker may guess the seed from several small strings

    hash(a) ^ hash(b) "removes" the suffix, but I don't see how to guess the prefix from this new value. It doesn't mean that it is not possible, just that I don't have a strong background in crytography :-)

    I don't expect that adding 2 XOR would change our dummy (fast but unsafe) hash function into a cryptographic hash function. We cannot have security for free. If we want a strong cryptographic hash function, it would be much slower (Paul wrote that it would be 4x slower). But we prefer speed over security, so we have to do compromise.

    I don't know if you can retreive hash values in practice. I suppose that you can only get hash(str) & (size - 1) with size=size of the dict internal array, so only the lower bits. Using a large dict, you may be able to retreive more bits of the hash value.

    tiran commented 12 years ago

    Given that a user has an application with an oracle function that returns the hash of a unicode string, an attacker can probe tenth of thousand one and two character unicode strings. That should give him/her enough data to calculate both seeds. hash("") already gives away lots of infomration about the seeds, too.

    3cbf7d86-addd-4bfa-8721-7e95bc50ded0 commented 12 years ago
    • for small strings we could use a different seed than for larger strings

    Or just leave them unseeded with our existing algorithm. Shifting them into a different part of the hash space doesn't really gain us much.

    • for larger strings we could use Paul's algorithm but limit the XOR op to the first and last 16 elements instead of all elements.

    Agreed. It does have to be both the first and the last though. We can't just do one or the other.

    tiran commented 12 years ago

    Paul wrote:

    I suggest that we don't randomize strings shorter than 6 characters. For longer strings, we randomize the first and last 5 characters. This means we're only adding additional work to a max of 10 rounds of the hash, and only for longer strings. Collisions with the hash from short strings should be minimal.

    It's too surprising for developers when just the strings with 6 or more chars are randomized. Barry made a good point http://bugs.python.org/issue13703#msg150613

    vstinner commented 12 years ago

    "Calculating the hash of a null byte gives you the xor of your two seeds."

    Not directly because prefix is first multiplied by 1000003. So hash("\0") gives you (prefix*1000003) % 2^32 ^ suffix.

    Example:

    $ ./python 
    secret={b7abfbbf, db6cbb4d}
    Python 3.3.0a0 (default:547e918d7bf5+, Jan  5 2012, 01:36:39) 
    >>> hash("")
    1824997618
    >>> hash("\0")
    -227042383
    >>> hash("\0"*2)
    1946249080
    >>> 0xb7abfbbf ^ 0xdb6cbb4d
    1824997618
    >>> (0xb7abfbbf * 1000003) & 0xffffffff ^ 0xdb6cbb4d
    4067924912
    >>> hash("\0") & 0xffffffff
    4067924913
    vstinner commented 12 years ago

    At least for Python 2.x hash(str) and hash(unicode) have to yield the same result for ASCII only strings.

    Ah yes, I forgot Python 2: I wrote my patch for Python 3.3. The two hash functions should be modified to be randomized.

    hash("") should always return 0

    Ok, I can add a special case. Antoine told me that hash("") gives prefix ^ suffix, which is too much information for the attacker :-)

    for small strings we could use a different seed than for larger strings

    Why? The attack doesn't work with short strings? What do you call a "short string"?

    vstinner commented 12 years ago

    Patch version 2:

    tiran commented 12 years ago

    In reply to MAL's message http://bugs.python.org/issue13703#msg150616

    1. Changing the semantics of hashing in a dot release is not allowed.

    I concur with Marc. The change is too intrusive and may cause too much trouble for the issue. Also it seems to be unnecessary for platforms with 64bit hash.

    Marc: Fred told me that ZODB isn't affected. One thing less to worry. ;)

    1. Hashing needs to be fast.

    Good point, we should include Tim and Christian Tiesmer once we have a solution we can agree upon

    PS: I'm missing "Reply to message" and a threaded view for lengthy topics

    917722a4-75c7-42b9-8682-1cc9e3abd141 commented 12 years ago

    I am wondering if a CVE id has been assigned to this security issue yet?