Context:: Go-Carbon supports real-time indexing in trie index in Carbonserver. If the realtime-index parameter in config is > 0, it creates a special channel for a new metric with that size. Then cache.go populates that channel if the metric is missed in the cache. Then Carbonserver consumes this channel and populates the trie index from it during a file scan, creating index entries for the metric even if the whisper file doesn't exist.
Problem: Looks like the cache is not a good predictor for new metrics. When the cache is empty and there is a lot of incoming traffic, the file scan thread is blocked for a long time and the scan never finishes.
Solution: We can use a simpler structure than the cache (a map) to detect previously seen metrics. We can use bloom filters, which are good for this and have limited space. I use cuckoo filters, which are faster and support deletion.
I added cuckoo filter support to cache.go with tests. I also added support for the bloom-size parameter in the cache config. If > 0, the cache will use a bloom filter of a specified size to detect new metrics. I'm also doing deletion from the filter if a metric leaves the cache. I'm not sure if we need this, but it might help in case of long uptime.
Context:: Go-Carbon supports real-time indexing in trie index in Carbonserver. If the
realtime-index
parameter in config is > 0, it creates a special channel for a new metric with that size. Then cache.go populates that channel if the metric is missed in the cache. Then Carbonserver consumes this channel and populates the trie index from it during a file scan, creating index entries for the metric even if the whisper file doesn't exist.Problem: Looks like the cache is not a good predictor for new metrics. When the cache is empty and there is a lot of incoming traffic, the file scan thread is blocked for a long time and the scan never finishes.
Solution: We can use a simpler structure than the cache (a map) to detect previously seen metrics. We can use bloom filters, which are good for this and have limited space. I use cuckoo filters, which are faster and support deletion.
I added cuckoo filter support to cache.go with tests. I also added support for the bloom-size parameter in the cache config. If > 0, the cache will use a bloom filter of a specified size to detect new metrics. I'm also doing deletion from the filter if a metric leaves the cache. I'm not sure if we need this, but it might help in case of long uptime.