Analysis of hash distribution of ksuids.

lewchuk commented 7 years ago

Has any analysis of common hash distribution of ksuid values been done? There exists use cases where you want the time ordering for client side uses of ids but on the server side use the ids for distributed storage in a cluster data store. In this case a typical strategy is to hash the ids and use that to determine a shard within the datastore for that entity.

Does the timestamp prefixing cause hashes of ksuids to cluster under common hash functions, e.g. MD5, various FNV or SHA algorithms etc?

achille-roussel commented 7 years ago

This is not something we've done because this hasn't been a use case we had, however if you're worried about the 4 leading bytes of timestamp causing bad distributions you could hash the KSUID payload only (or use it as a 128 bits hash value directly if that fits your needs).

If you dig more into that I'd be interested to know what your findings are.

rbranson commented 7 years ago

@lewchuk -- This is a good question, but I would say that if a hashing function isn't able to create a random distribution out of KSUIDs that it is not a very good hashing function.

One of the key purposes of a cryptographic hashing function is to obscure the input data. If the distribution of the digest isn't random, one could use that to "predict" the input to a certain probability. That would be a very weak hash function, not something that would be generally accepted. Even "compromised" hashing algorithms like MD5 and SHA1 will still yield a random enough distribution for a hash table.

For non-cryptographic hash functions like FNV (which is just a XOR loop), a simple incrementing timestamp would be more of a concern, but still minor since again the primary purpose of a hash function is to take in any kind of data and output a randomly distributed digest. Mathematically all hashing functions will "absorb" all the entropy of the input data. Therefore the 128 random bits at the tail of a KSUID should theoretically "scramble" the 32-bits of low entropy data at the head.

lewchuk commented 7 years ago

Thanks for the response @rbranson, that makes sense. In the follow up research for my particular situation (DynamoDB shard keys), random suffixes on dates is actually a recommended strategy for good hash distribution of shard keys.

thestephenstanton commented 3 years ago

I know this is an old thread but I have recently run into an issue when using Kinesis. Could be completely unrelated, but we use ksuids as the partition key for kinesis. We had 18 shards and in 1 minute we had about 4.5k write throughput exceeded errors on a single shard. Unfortunately 1 minute is the lowest time interval they'll give you.

achille-roussel commented 3 years ago

Thanks for adding content to the conversation @thestephenstanton. Did you receive feedback from AWS that using KSUIDs as partition keys caused the imbalanced distribution of records across the shards?

thestephenstanton commented 3 years ago

@achille-roussel I have not yet. AWS, in my experience, is not generally helpful when asking questions like this that require a deep knowledge of the service. But I am also trying to get a couple more data points since it isn't always happening. It has happened only twice in the past week. But I will reach out to them and update this post if I have any more findings.

EDIT: I was just looking back and remembered I had actually asked about this a while back. The response was essentially that I needed to have retry logic because even if I used a random key, there is no guarantee that you would always get a new shard. So there is a possibility of making one really hot and not being able to use it till it resets the rate limit.

caldempsey commented 5 months ago

@thestephenstanton Did we ever get any solid benchmarks/outcomes?

thestephenstanton commented 5 months ago

@thestephenstanton Did we ever get any solid benchmarks/outcomes?

@caldempsey we did not (to my recollection). We ended up having so many other problems with Kinesis that we ended up moving the Kafka. Using ksuid for the partition key did not end up having the same problem.

caldempsey commented 3 months ago

Bucket 0: 178082 KSUIDs Bucket 1: 45305 KSUIDs Bucket 2: 79787 KSUIDs Bucket 3: 169645 KSUIDs Bucket 4: 33376 KSUIDs Bucket 5: 535945 KSUIDs Bucket 6: 181918 KSUIDs Bucket 7: 36933 KSUIDs Bucket 8: 21748 KSUIDs Bucket 9: 239968 KSUIDs Bucket 10: 108991 KSUIDs Bucket 11: 179883 KSUIDs Bucket 12: 192797 KSUIDs Bucket 13: 193389 KSUIDs Bucket 14: 146850 KSUIDs Bucket 15: 790831 KSUIDs Bucket 16: 191241 KSUIDs Bucket 17: 200194 KSUIDs Bucket 18: 29267 KSUIDs Bucket 19: 72021 KSUIDs Bucket 20: 242820 KSUIDs Bucket 21: 25388 KSUIDs Bucket 22: 100534 KSUIDs Bucket 23: 194944 KSUIDs Bucket 24: 51938 KSUIDs Bucket 25: 375328 KSUIDs Bucket 26: 222163 KSUIDs Bucket 27: 133024 KSUIDs Bucket 28: 893468 KSUIDs Bucket 29: 194236 KSUIDs Bucket 30: 222859 KSUIDs Bucket 31: 100536 KSUIDs Bucket 32: 25174 KSUIDs Bucket 33: 289273 KSUIDs Bucket 34: 132434 KSUIDs Bucket 35: 510787 KSUIDs Bucket 36: 56024 KSUIDs Bucket 37: 51856 KSUIDs Bucket 38: 77686 KSUIDs Bucket 39: 41334 KSUIDs Bucket 40: 121210 KSUIDs Bucket 41: 118417 KSUIDs Bucket 42: 1450617 KSUIDs Bucket 43: 46008 KSUIDs Bucket 44: 48228 KSUIDs Bucket 45: 113773 KSUIDs Bucket 46: 33750 KSUIDs Bucket 47: 277452 KSUIDs Bucket 48: 88089 KSUIDs Bucket 49: 132479 KSUIDs

package main

import (
    "fmt"
    "hash/crc32"
    "sort"

    "github.com/segmentio/ksuid"
)

// Node represents a bucket or shard.
type Node struct {
    ID   int
    Hash uint32
}

// ConsistentHash implements a simple consistent hashing mechanism.
type ConsistentHash struct {
    nodes Nodes
}

// Nodes is a slice of Node.
type Nodes []Node

// Len returns the length of the Nodes slice.
func (n Nodes) Len() int { return len(n) }

// Less compares the hash of two nodes.
func (n Nodes) Less(i, j int) bool { return n[i].Hash < n[j].Hash }

// Swap swaps two nodes in the Nodes slice.
func (n Nodes) Swap(i, j int) { n[i], n[j] = n[j], n[i] }

// NewConsistentHash creates a new ConsistentHash with a given number of buckets.
func NewConsistentHash(numBuckets int) *ConsistentHash {
    ch := &ConsistentHash{}
    for i := 0; i < numBuckets; i++ {
        node := Node{
            ID:   i,
            Hash: hash(fmt.Sprintf("Node-%d", i)),
        }
        ch.nodes = append(ch.nodes, node)
    }
    sort.Sort(ch.nodes)
    return ch
}

// GetNode returns the node ID for the given key.
func (ch *ConsistentHash) GetNode(key string) int {
    keyHash := hash(key)
    // Find the first node whose hash is greater than or equal to the key hash.
    for _, node := range ch.nodes {
        if keyHash <= node.Hash {
            return node.ID
        }
    }
    // Wrap around to the first node if necessary.
    return ch.nodes[0].ID
}

// hash is a helper function to hash a string using CRC32.
func hash(key string) uint32 {
    return crc32.ChecksumIEEE([]byte(key))
}

func main() {
    numBuckets := 50
    numKSUIDs := 10000000

    ch := NewConsistentHash(numBuckets)

    // Create buckets to store IDs.
    buckets := make([]int, numBuckets)

    // Generate KSUIDs and distribute them across buckets.
    for i := 0; i < numKSUIDs; i++ {
        id := ksuid.New()
        node := ch.GetNode(id.String())
        buckets[node]++
    }

    // Print the distribution of KSUIDs across buckets.
    for i, count := range buckets {
        fmt.Printf("Bucket %d: %d KSUIDs\n", i, count)
    }
}

caldempsey commented 3 months ago

cc: @thestephenstanton continuing the narrative above

caldempsey commented 3 months ago

However when using a more niave approach:

package main

import (
    "fmt"

    "github.com/google/uuid"
    "github.com/segmentio/ksuid"
)

// GenerateUUIDBuckets generates UUIDs and distributes them into buckets using modulo.
func GenerateUUIDBuckets(numBuckets, numUUIDs int) []int {
    buckets := make([]int, numBuckets)

    for i := 0; i < numUUIDs; i++ {
        id := uuid.New()
        hash := naiveHash(id.String())
        bucket := hash % uint32(numBuckets)
        buckets[bucket]++
    }

    return buckets
}

// GenerateKSUIDBuckets generates KSUIDs and distributes them into buckets using modulo.
func GenerateKSUIDBuckets(numBuckets, numKSUIDs int) []int {
    buckets := make([]int, numBuckets)

    for i := 0; i < numKSUIDs; i++ {
        id := ksuid.New()
        hash := naiveHash(id.String())
        bucket := hash % uint32(numBuckets)
        buckets[bucket]++
    }

    return buckets
}

// naiveHash is a simple hash function using CRC32 for simplicity.
func naiveHash(key string) uint32 {
    var hash uint32
    for _, char := range key {
        hash = hash*31 + uint32(char)
    }
    return hash
}

func main() {
    numBuckets := 50
    numUUIDs := 1000000
    numKSUIDs := 1000000

    // Generate and distribute UUIDs
    uuidBuckets := GenerateUUIDBuckets(numBuckets, numUUIDs)
    fmt.Println("UUID Distribution:")
    for i, count := range uuidBuckets {
        fmt.Printf("Bucket %d: %d UUIDs\n", i, count)
    }

    // Generate and distribute KSUIDs
    ksuidBuckets := GenerateKSUIDBuckets(numBuckets, numKSUIDs)
    fmt.Println("\nKSUID Distribution:")
    for i, count := range ksuidBuckets {
        fmt.Printf("Bucket %d: %d KSUIDs\n", i, count)
    }
}

The coefficient of variation is about the same (a small win for UUIDv4s)

UUID Distribution:
Bucket 0: 20225 UUIDs
Bucket 1: 20013 UUIDs
Bucket 2: 20078 UUIDs
Bucket 3: 20190 UUIDs
Bucket 4: 19936 UUIDs
Bucket 5: 19760 UUIDs
Bucket 6: 19905 UUIDs
Bucket 7: 20204 UUIDs
Bucket 8: 19958 UUIDs
Bucket 9: 20031 UUIDs
Bucket 10: 20148 UUIDs
Bucket 11: 19891 UUIDs
Bucket 12: 19944 UUIDs
Bucket 13: 20178 UUIDs
Bucket 14: 20029 UUIDs
Bucket 15: 20021 UUIDs
Bucket 16: 19726 UUIDs
Bucket 17: 20043 UUIDs
Bucket 18: 19979 UUIDs
Bucket 19: 19966 UUIDs
Bucket 20: 19911 UUIDs
Bucket 21: 20070 UUIDs
Bucket 22: 20065 UUIDs
Bucket 23: 20082 UUIDs
Bucket 24: 20094 UUIDs
Bucket 25: 20121 UUIDs
Bucket 26: 19874 UUIDs
Bucket 27: 19828 UUIDs
Bucket 28: 19928 UUIDs
Bucket 29: 19813 UUIDs
Bucket 30: 19860 UUIDs
Bucket 31: 19981 UUIDs
Bucket 32: 20110 UUIDs
Bucket 33: 19986 UUIDs
Bucket 34: 19826 UUIDs
Bucket 35: 19859 UUIDs
Bucket 36: 20073 UUIDs
Bucket 37: 20025 UUIDs
Bucket 38: 20052 UUIDs
Bucket 39: 20001 UUIDs
Bucket 40: 20090 UUIDs
Bucket 41: 19953 UUIDs
Bucket 42: 19909 UUIDs
Bucket 43: 20006 UUIDs
Bucket 44: 19982 UUIDs
Bucket 45: 20061 UUIDs
Bucket 46: 19917 UUIDs
Bucket 47: 20120 UUIDs
Bucket 48: 20158 UUIDs
Bucket 49: 20020 UUIDs

KSUID Distribution:
Bucket 0: 20076 KSUIDs
Bucket 1: 19899 KSUIDs
Bucket 2: 19984 KSUIDs
Bucket 3: 19832 KSUIDs
Bucket 4: 20030 KSUIDs
Bucket 5: 20119 KSUIDs
Bucket 6: 19826 KSUIDs
Bucket 7: 19913 KSUIDs
Bucket 8: 20036 KSUIDs
Bucket 9: 19846 KSUIDs
Bucket 10: 19934 KSUIDs
Bucket 11: 19851 KSUIDs
Bucket 12: 19988 KSUIDs
Bucket 13: 19927 KSUIDs
Bucket 14: 19886 KSUIDs
Bucket 15: 19805 KSUIDs
Bucket 16: 20164 KSUIDs
Bucket 17: 19974 KSUIDs
Bucket 18: 20201 KSUIDs
Bucket 19: 20073 KSUIDs
Bucket 20: 19905 KSUIDs
Bucket 21: 19928 KSUIDs
Bucket 22: 19948 KSUIDs
Bucket 23: 20026 KSUIDs
Bucket 24: 20104 KSUIDs
Bucket 25: 19949 KSUIDs
Bucket 26: 20003 KSUIDs
Bucket 27: 19860 KSUIDs
Bucket 28: 20273 KSUIDs
Bucket 29: 20004 KSUIDs
Bucket 30: 19802 KSUIDs
Bucket 31: 20087 KSUIDs
Bucket 32: 20035 KSUIDs
Bucket 33: 20033 KSUIDs
Bucket 34: 20127 KSUIDs
Bucket 35: 20136 KSUIDs
Bucket 36: 20038 KSUIDs
Bucket 37: 19884 KSUIDs
Bucket 38: 19985 KSUIDs
Bucket 39: 19805 KSUIDs
Bucket 40: 20144 KSUIDs
Bucket 41: 20175 KSUIDs
Bucket 42: 20121 KSUIDs
Bucket 43: 20110 KSUIDs
Bucket 44: 20261 KSUIDs
Bucket 45: 20165 KSUIDs
Bucket 46: 19962 KSUIDs
Bucket 47: 19722 KSUIDs
Bucket 48: 19945 KSUIDs
Bucket 49: 20099 KSUIDs

segmentio / ksuid

Analysis of hash distribution of ksuids. #13