rurban / smhasher

Hash function quality and speed tests
https://rurban.github.io/smhasher/
Other
1.85k stars 178 forks source link

Maybe a test for a new generation of wyhash-like functions is necessary? #177

Closed avaneev closed 3 years ago

avaneev commented 3 years ago

I've recently studied wyhash. Right now wyhash processes ("mixes") message bi-directionally (from start, and from end). While it does work as it is, passes SMHasher tests, its structure "breaks" on Permutation tests if the message is processed sequentially instead. In sequential mode, the structure only allows mixing of a single 64-bit message word, if two 64-bit message words are mixed per round, correlations arise. As I'm developing PRVHASH myself, I've found similarities with wyhash - the functions are quite different, but have similar issues. This is the main reason I'm unable to increase my function's performance further by a factor of 2 - this would require bi-directional processing, which makes streamed hashing impossible. (I'm not quite sure what's the purpose of high-speed hash function if it can't hash files in a streamed mode - I do not think anyone is going to keep multi-gigabyte files in memory).

dumblob commented 3 years ago

@wangyi-fudan thoughts?

wangyi-fudan commented 3 years ago

bi-direction? how you get this feeling. wyhash is sequential from the begining to the end. when it reaches unaligned end, it finish it with a end-aligned read.

avaneev commented 3 years ago

bi-direction? how you get this feeling. wyhash is sequential from the begining to the end. when it reaches unaligned end, it finish it with a end-aligned read.

I've meant such operations: { a=_wyr8(p); b=_wyr8(p+len-8); }.

avaneev commented 3 years ago

From my own tests, such construct fails with Permutation test: seed=_wymix(_wyr8(p)^secret[1],_wyr8(p+8)^seed);.

avaneev commented 3 years ago

@wangyi-fudan You are handling non-multiple of 8 buffer sizes rather originally, and maybe you are right with your approach. But technically you have data overlap in some cases.

wangyi-fudan commented 3 years ago

yes, there is data overlap. however, the overlaped data are XORed with different mask. so, actually, there is no problem

avaneev commented 3 years ago

yes, there is data overlap. however, the overlaped data are XORed with different mask. so, actually, there is no problem

There's one thing I've noted as well, what will happen if i==4 here (and I assume this may happen if len==60):

while(_unlikely_(i>16)){    seed=_wymix(_wyr8(p)^secret[1],_wyr8(p+8)^seed);    i-=16; p+=16;   }
    a=_wyr8(p+i-16); b=_wyr8(p+i-8);

Will it be a buffer overrun?

wangyi-fudan commented 3 years ago

since this code happed in len>16 branch. it is safe to back-read 16 bytes.

avaneev commented 3 years ago

since this code happed in len>16 branch. it is safe to back-read 16 bytes.

b=_wyr8(p+i-8); will overrun if i==4.

avaneev commented 3 years ago

As far as my original post is concerned, more specifically, correlations may be detected in wyhash-like structure if elements of the Permutation data set are appended with zeroes, so that the hash function runs for several "idle" rounds. At least it was so in my tests, I may be wrong, of course.

avaneev commented 3 years ago

The reason wyhash-like structures are problematic is that they perform "compression" after multiplication: m=A*B; return m^m>>64. Loss of state information. They work great only if input has the same length as state variable: in this case the provided entropy is dispersed back almost completely, but still is subject to occassional loss, especially apparent with 16-bit state variables.

wangyi-fudan commented 3 years ago

the problem of wyhash is that it has no problem ;-)

avaneev commented 3 years ago

the problem of wyhash is that it has no problem ;-)

Well, maybe, as you wish. Anyway, it's not usable for file hashing in its current form. Strangely enough, Reini Urban advertises fast hash functions as useful for file hashing without even checking if they have the necessary interface.

avaneev commented 3 years ago

I'm closing this topic. Sorry for disturbance. Maybe 128-bit multiplication does allow to mix two 64-bit messages, need to do more tests.

avaneev commented 3 years ago

Anyway, from the point of view of "message shuffling", a construct like m=A*B; return m^m>>64 used in wyhash is not ideal. You may compare it to my inner shuffler which produces an ideal remapping without holes. For certain, if 2 message words will be shuffled, it will be a complete loss of information.

#include <stdio.h>
#include <stdint.h>

const int shuf_bits = 10;
const int shuf = 1 << shuf_bits;
const int mask = shuf - 1;

uint32_t mulfn( uint32_t Seed, uint32_t Seed2, uint32_t m )
{
/*  Seed2 ^= m;
    Seed += Seed2;
    Seed &= mask;
    Seed *= ( Seed2 - ~Seed2 ) & mask;
*/
    uint32_t mv = Seed ^ m;
    mv *= Seed2;
    Seed = mv ^ mv >> shuf_bits;

    return( Seed & mask );
}

int main()
{
    uint64_t HitCount = 0;
    uint32_t Seed;
    uint32_t Seed2;
    uint32_t m;

    for( Seed = 0; Seed < shuf; Seed++ )
    {
        for( Seed2 = 0; Seed2 < shuf; Seed2++ )
        {
            uint8_t HitBuf[ shuf ] = { 0 };

            for( m = 0; m < shuf; m++ )
            {
                uint32_t v = mulfn( Seed, Seed2, m );

                if( HitBuf[ v ] == 0 )
                {
                    HitBuf[ v ] = 1;
                    HitCount++;
                }
            }
        }
    }

    printf( "Hits: %llu/%llu\n", HitCount, (uint64_t) shuf * shuf * shuf );
}