iotaledger / wasp

Node for IOTA Smart Contracts
Apache License 2.0
296 stars 147 forks source link

In-memory Trie-cache is too slow #2485

Closed fijter closed 1 year ago

fijter commented 1 year ago

Replace the current cache we are using golang-lru with another solution and test.

fijter commented 1 year ago

luke benchmarking several libraries against our testnet database.

luke-thorne commented 1 year ago

@dessaya I put together some graphs of the data I was able to gather. I used a Zipfian-based generator to emulate a few regularly active addresses and mostly inactive addresses.

TLDR;

golang-lru and fastcache were remarkably similar. I'll replacing it with fastcache but other alternatives had significantly lower hitrates or higher memory usage.

ns/op

fastcache had the most consistent ns/op across the board regardless of the cache size.

Image

Hitrate

There was some variance when the cache was small but as the caches got bigger with bigger data sets the main options converged on 43%.

Image

Allocs/op

fastcache was the only option that didn't have a variance in allocs/op.

Image

B/op

theine used the least B/op but it's a negligible difference.

Image

dessaya commented 1 year ago

That's interesting! But does the benchmark generate concurrent accesses? The main bottleneck with golang-lru was caused by this lock: https://github.com/hashicorp/golang-lru/blob/master/lru.go#L96

https://github.com/hashicorp/golang-lru/issues/9

luke-thorne commented 1 year ago

It does not but fastcache is specifically designed to scale with the CPUs available and remain thread-safe. It also emphasizes reducing the GC overhead. They have some benchmarks with a few caches on their Github as well.

Yiling-J commented 1 year ago

@luke-thorne I'm the author of Theine and I think there are some misconfigurations in your hit ratio benchmark.

https://github.com/iotaledger/wasp/blob/9851d05df83b033654409f5a11ee9186bbc9b5a2/packages/trie/test/trie_test.go#L723 Here Theine use size instead of (size * itemCost), but Set method https://github.com/iotaledger/wasp/blob/9851d05df83b033654409f5a11ee9186bbc9b5a2/packages/trie/test/trie_test.go#L736 use len(value) as cost, I'm not sure if this is correct.

https://github.com/iotaledger/wasp/blob/9851d05df83b033654409f5a11ee9186bbc9b5a2/packages/trie/test/trie_test.go#LL698C12-L698C12 Ristretto's Set method use 0 as cost, but you don't provide a Cost function in config. Also NumCounters should be 10 * max cost.

https://github.com/iotaledger/wasp/blob/9851d05df83b033654409f5a11ee9186bbc9b5a2/packages/trie/test/trie_test.go#L652 I think itemCost should be 40, not 40 * 4?

Also fastcache's 10000/100000 hit ratio is extreamly high compared to others, maybe fastcache doesn't evict items when memory usage is low.

Here is benchmark results using my own benchmarks tool with same ScrambledZipfian Generator and run 10,000,000 times:

--- theine 1000000 hit ratio: 0.506
--- ristretto 1000000 hit ratio: 0.400
--- lru 1000000 hit ratio: 0.494
--- 2q 1000000 hit ratio: 0.508
--- arc 1000000 hit ratio: 0.508
--- fastcache 1000000 hit ratio: 0.481
Yiling-J commented 1 year ago

FYI I also test DS1 workload using my go-cache-benchmark-plus, this is the results:

--- theine 8000000 hit ratio: 0.709
--- ristretto 8000000 hit ratio: 0.119
--- lru 8000000 hit ratio: 0.430
--- 2q 8000000 hit ratio: 0.574
--- arc 8000000 hit ratio: 0.521
--- fastcache 8000000 hit ratio: 0.224

I think it would be better if you have some real trace to benchmark hit ratios, or do various type of workloads, not Zipf only if hit ratio is important.

luke-thorne commented 1 year ago

@Yiling-J thanks for the advice! I tried running the benchmark again but ended up getting ~600-700 B/op for theine when setting the size as size * itemCost with itemCost = 40. How would you recommend setting the size parameter when storing a predictable size value like fmt.Sprintf("just a bunch of dummy text for key %s", key) and how would you recommend setting it for larger, less predictable value sizes?

I got the 43% hitrate though so that aligned with the expected value.

Yiling-J commented 1 year ago

@luke-thorne Theine has a probabilistic data structure, which is created when cache initialized. B/op also take this part into consideration. If you force the bench to run 10 million times with -benchtime, the B/op will looks normal. But there is actually a bug I found when debugging this, I create the data structure based on total cost, not total items count, this waste a lot memory. Can you run with 0.2.7 again and take a look? (don't change benchtime, the data structure is much smaller after the fix)

About the size, you can set size to max item count you want to store, but cost param in Set method should be 1. Or in second case you can use dynamic cost and limit total cost of cache.

luke-thorne commented 1 year ago

@Yiling-J Thanks! That brought almost every metric up to the top of the leaderboard! I'm seeing a spike in ns/op though. I'm just using 1 for the item cost for now. Any chance you can see something I misconfigured?

        "theine": {
            Init: func(size int) {
                theineCache, _ = theine.NewBuilder[string, []byte](int64(size)).Build()
            },
            Close: func() {
                theineCache.Close()
                theineCache = nil
            },
            Get: func(key []byte) bool {
                if _, ok := theineCache.Get(string(key)); ok {
                    return true
                }
                return false
            },
            Set: func(key, value []byte) bool {
                return theineCache.Set(string(key), value, 1)
            },
        },
image
Yiling-J commented 1 year ago

@luke-thorne Your config looks good. The spike you see come from the implementation of Theine, Theine store data in a shard map and each shard has a small deque, this deque can improve write performance. Size of deque is cache size / 100 / shard count. When your cache size is 10000, the deque size in each shard is actually near 0.