orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.74k stars 870 forks source link

ORecordId - hashCode function collisions #3903

Closed swimmesberger closed 9 years ago

swimmesberger commented 9 years ago

The hashCode function of the ORecordId class seems to not work properly:

    ORecordId rec = new ORecordId(12, 31);
    int hash = rec.hashCode();
    ORecordId rec1 = new ORecordId(13, 0);
    int hash1 = rec1.hashCode();

Both hash and hash1 returns "403" as integer value.

Version: 2.0.5

andrii0lomakin commented 9 years ago

We have com.orientechnologies.orient.core.index.hashindex.local.OMurmurHash3HashFunction hash function do not you mind to implement it as source of hash code and test how performance was changed ?

lvca commented 9 years ago

@laa do you mean implementing ORecordId.hashCode() using Murmur algorithm? If I'm right, +1

andrii0lomakin commented 9 years ago

yes, but I will really appreciate if @swimmesberger will try himself if he is interested of course.

swimmesberger commented 9 years ago
    final OMurmurHash3HashFunction<OIdentifiable> hasher = new OMurmurHash3HashFunction<OIdentifiable>();
    hasher.setValueSerializer(new OStreamSerializerRID());
    ORecordId rec = new ORecordId(12, 31);
    long hash = hasher.hashCode(rec);
    ORecordId rec1 = new ORecordId(13, 0);
    long hash1 = hasher.hashCode(rec1);

Seems to work better (no collisions so far) but not really a dropin replacement for hashCode because hashCode returns an integer value and "OHashFunction hashCode(V value)" returns a long value. (if we didn't want to discard any information)

The OMurmurHash3HashFunction is indeed "much" slower than the normal hashCode function but for my usecase the performance drop is justifiable.

In my fast non optimized test 10.000 calls to OMurmurHash3HashFunction hashCode function takes ~52ms and using the current hashCode function with 10.000 calls takes ~8ms.

lvca commented 9 years ago

Any idea to improve current ORecordID.hashCode() or can I close this issue?

swimmesberger commented 9 years ago

No idea but I just wanted to say that we can't really rely on ORecordID.hashCode() for the reason stated above so using ORecordID objects in HashMaps for example could be pretty dangerous.

doppelrittberger commented 9 years ago

Hi, I took a look into the hashCode() method of the class ORecordId. You should use this method: return 31 * clusterId + 37 * (int) (clusterPosition ^ (clusterPosition >>> 32)); instead of the old one (like proposed by Josh Bloch's Effective Java, http://stackoverflow.com/questions/113511/best-implementation-for-hashcode-method). This solves most collisions for me.

lvca commented 9 years ago

I don't think 37 could brings any advantages. Do you have any data how this number reduced your collision rate?

doppelrittberger commented 9 years ago

I made a small test:

public static void main (String[] args)
    {
        Set<Integer> oldHashCodes = new HashSet<>();
        int oldHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 100; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 100000; clusterPosition++) {
                int oldHashCode = oldHashCode(clusterId, clusterPosition);
                if (!oldHashCodes.add(oldHashCode)) {
                    oldHashCodeCollisions++;
                }
            }
        }
        System.out.println("Old hashCode collisions: " + oldHashCodeCollisions);
        Set<Integer> newHashCodes = new HashSet<>();
        int newHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 100; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 100000; clusterPosition++) {
                int newHashCode = newHashCode(clusterId, clusterPosition);
                if (!newHashCodes.add(newHashCode)) {
                    newHashCodeCollisions++;
                }
            }
        }
        System.out.println("New hashCode collisions: " + newHashCodeCollisions);
    }

    private static int oldHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + (int) (clusterPosition ^ (clusterPosition >>> 32));
    }

    private static int newHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + 103 * (int) clusterPosition;
    }

and the results are impressive: Old hashCode collisions: 9996931 New hashCode collisions: 0

lvca commented 9 years ago

Impressive! Why 103?

doppelrittberger commented 9 years ago

I did some testing with bigger prime numbers and 103 worked well :)

swimmesberger commented 9 years ago

Sounds very good! :) Thanks for the test! I didn't really had time to do some testing with different hashcode functions.

doppelrittberger commented 9 years ago

You´re welcome

lvca commented 9 years ago

Tested also with random RIDs and the more the RIDs are lower, the best result you have. Actually in 99% of cases this is what happen: rids are incremental. So definitely +1. @doppelrittberger Seems you found the magic number (103). Thanks ;-)

PhantomYdn commented 9 years ago

Guys, new hash will fail if you will test more than just 100:)

Simple example:

Records 103:62 and 206:31 will have the same hash code:
31*103 + 103*62 = 31*206+103*31 
because:
31*1*103+103*2*31=31*2*103+103*1*31 and after changing orders
31*1*103+103*2*31=31*1*103+103*2*31

I'm not expect in cipher, but you will definitely need in XOR : affine transformations will not definitely work.

PhantomYdn commented 9 years ago

And, btw, there is no requirement in java to have different hashcodes for different objects. But for hashcode for ORecordId I would recommend the following requirements: 1) Very low probability of impact in case of the same cluster rather than between clusters 2) Very fast in calculation

The following (autogenerated in eclipse) function much better than proposed:

@Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + clusterId;
        result = prime * result + (int) (clusterPosition ^ (clusterPosition >>> 32));
        return result;
    }
smolinari commented 9 years ago

Nothing like what you can up with Ilia, but we were looking at using this very small library for hash Ids. We are thinking more on the lines of URL ids too, which I know isn't really part of this discussion or issue.

http://hashids.org/java/

The cool thing with this hashing algorythm is, you can insert an rid and get a nice URL friendly hash Id. And, when using salt (the third input), the rid could be tokenized to an extent.

Pretty sure this isn't helpful and sorry if it isn't. But still, I thought I'd throw it into the conversation.

Scott

doppelrittberger commented 9 years ago

@PhantomYdn: you´re right but since in a typical usecase the clusterId is much smaller than the clusterPosition this simple "trick" can prevent hash collisions. And even you´re also right that hash collisions are no real issue in Java, they might impact the performance of hash based sets and maps (like analyzed here: http://www.nurkiewicz.com/2014/04/hashmap-performance-improvements-in.html). So my suggestion was just to multiple the clusterId with a bigger prime to avoid collisions. The current version simply does 31*clusterId+clusterPosition and this generates much more collisions. In addition taking the first 32 bit of a long value just cuts off the sign and causes additional collisions (since every new record gets a negative clusterPosition) -> 1:1 collides with 1:-1 and so on... One final thought: since every database can only store up to 32,767 clusters you can also multiply the clusterId with the next prime (32771) and prevent all collisions based on clusterIds:

public static void main (String[] args)
    {
        Set<Integer> oldHashCodes = new HashSet<>();
        int oldHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 10000; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 1000; clusterPosition++) {
                int oldHashCode = oldHashCode(clusterId, clusterPosition);
                if (!oldHashCodes.add(oldHashCode)) {
                    oldHashCodeCollisions++;
                }
            }
        }
        System.out.println("Old hashCode collisions: " + oldHashCodeCollisions);
        Set<Integer> newHashCodes = new HashSet<>();
        int newHashCodeCollisions = 0;
        for (int clusterId = 0; clusterId < 10000; clusterId++) {
            for (int clusterPosition = -1000; clusterPosition < 1000; clusterPosition++) {
                int newHashCode = newHashCode(clusterId, clusterPosition);
                if (!newHashCodes.add(newHashCode)) {
                    newHashCodeCollisions++;
                }
            }
        }
        System.out.println("New hashCode collisions: " + newHashCodeCollisions);
    }

    private static int oldHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + (int) (clusterPosition ^ (clusterPosition >>> 32));
    }

    private static int newHashCode (int clusterId, long clusterPosition)
    {
        return 31 * clusterId + 32771 * (int) clusterPosition;
    }

This code prints: Old hashCode collisions: 19689031 New hashCode collisions: 0