grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
65.04k stars 12.14k forks source link

Caching Service #10816

Closed torkelo closed 5 years ago

torkelo commented 6 years ago

Right now in Grafana we do not have a good solution for caching. The only cache we have is a 5 second cache for data source settings (improves speed for dataproxy). Using go-cache.

Requirements

Challanges

DB Table Solution

One solution would be to have a Caching service that continually check a "message" table for invalidation messages. Say every 5 seconds.

Message table

============================== ‣ id ‣ key ‣ value ‣

5 invalidate-cache allowed-dashboards


4 invalidate-cache datasources

==============================

The idea would be to insert the invalidation messages in the same transaction that would effect the cache. So in a transaction that modified dashboard permissions it would also insert a message to invalidate that cache.

Redis

Use redis as pubsub & cache data store.

Integrated distributed cache

Using something like https://github.com/golang/groupcache that is a in-process distributed cache.

bergquist commented 6 years ago

Database polling Tieing the DB transaction to the invalidation sounds like a good idea. But I think we can achieve the same goal using the bus and publish invalidation message after successful transactions. We can do a type assertion on the cmd and see if it implements InvalidateKeys() []string or something like that. I strongly dislike requiring sticky sessions.

GroupCache Looks like a nice project, but

does not support versioned values. If key "foo" is value "bar", key "foo" must always be "bar". There are neither cache expiration times, nor explicit cache evictions. Thus there is also no CAS, nor Increment/Decrement. This also means that groupcache....

Makes it a bad candidate for our use-case? Unless we keep create keys with timestamps and.... MESSY

Redis Requiring Redis is a big change. But I think its a reasonable requirement considering what we can offer in return. I also think it's on par with requiring sticky sessions (consider how many issues this will cause) Redis also enables us to start using websockets/longpolling.

woodsaj commented 6 years ago

Database polling

This is not a great option of HostedGrafana. Polling the database for changes puts unnecessary load on the database cluster. We have many thousands of instances, and having them all poll the DB to look for something that 99% of the time isn't there is not ideal.

woodsaj commented 6 years ago

go-xorm/xorm already supports caching. I definitely think we should be using that. https://godoc.org/github.com/go-xorm/xorm#Engine https://godoc.org/github.com/go-xorm/core#Cacher

For single Grafana instances, you could just use the LRU cache. https://godoc.org/github.com/go-xorm/xorm#LRUCacher

When running multiple Grafana servers there is alreay a Redis cache implementation https://github.com/go-xorm/xorm-redis-cache

I would also like to see a Memcache version as well, as memcache is easy to deploy and run.

bergquist commented 6 years ago

Using caching in xorm sounds like a good idea.

What I dont like about it is that you have to clear the cache sometimes.

Caution: When use Cols methods on cache enabled, the system still return all the columns. When using Exec method, you should clear cache:

engine.Exec("update user set name = ? where id = ?", "xlw", 1)
engine.ClearCache(new(User))

So easy to miss clearing the cache. Requiring tests for all operations involving the orm would make it easier.

torkelo commented 6 years ago

The problem is not what cache to use but when to invalidate in ha setups :)

torkelo commented 6 years ago

I think data source permissions will require us to solve this. Data source access is cached today (only a 5 second cache) but that 5s saves an incredible amount of DB queries as it is used in the datasource proxy & tsdb/query code paths.

But in order to check permissions as well in these code paths without a DB query we would need to cache that as well.

Wish we had a distributed cache, but so afraid to add that requirement for all HA setups. But been trying to figure out another way to do cache invalidations in a HA system using only DB but it's hard.

bergquist commented 6 years ago

As I see there are two ways to solve this problem.

Distributed cache

Local cache and invalidate when source updates (ex pubsub / server to server communication)

So. Do we prefer Availablity or Consistency? :)

marefr commented 6 years ago

Given that datasources should not change that often I think we should optimize for that and should be able cache more aggressively than 5 seconds, for example several minutes. If each query passing thru datasource proxy should require a version of the datasource to be used we can invalidate the cache earlier than X minutes given a user have refreshed Grafana UI or similar. If we do this plus add permissions in the datasource cache and are able to verify permission in memory I think a local cache would be more than sufficient.

bergquist commented 5 years ago

I suggest that we have two caches in Grafana

The distributed cache should be a facade that can be configured to store cache data in memory, redis, memcache and mysql/postgres. Defaulting to using the database configured to store dashboards to avoid breaking changes for those who run Grafana in HA environment. Since we already support different datastores in previous session implementation it would be quite easy to introduce this into Grafana.

Having two types of caches might seem confusing but that's something I hope we can address with naming. It’s hard to say if this would solve all our use-cases but it should solve all the use-cases listed in this issue.