Caching Service - Githubissues

torkelo commented 6 years ago

Right now in Grafana we do not have a good solution for caching. The only cache we have is a 5 second cache for data source settings (improves speed for dataproxy). Using go-cache.

Requirements

A way to cache SQL query results like dashboard permissions & other things
A way to invalidate cache that works in both single and multi server setups (HA)

Challanges

The main problem is how to invalidate the cache in HA setups.

DB Table Solution

One solution would be to have a Caching service that continually check a "message" table for invalidation messages. Say every 5 seconds.

Message table

============================== ‣ id ‣ key ‣ value ‣

5 invalidate-cache allowed-dashboards

4 invalidate-cache datasources

==============================

The idea would be to insert the invalidation messages in the same transaction that would effect the cache. So in a transaction that modified dashboard permissions it would also insert a message to invalidate that cache.

Pros: Invalidation messages tied to DB transaction, no way to get out of sync
Pros: Does not need a new dependency (redis / memcache)
Cons: Invalidations will be delayed by the polling frequency of the caching service (that checks the messages).
Cons: Might require sticky sessions if the invalidation delay is too high or if there are situations where even a small delay will cause issues (like a UI that saves and then loads directly after and expects to get the latest and not a response based on cached data)

Redis

Use redis as pubsub & cache data store.

Pros: Can enable more quick invalidations without fast DB polling
Pros: Might be a requirement for a solid websocket / streaming / events solution in the future
Cons: Quite big new requirement

Integrated distributed cache

Using something like https://github.com/golang/groupcache that is a in-process distributed cache.

Pros: No new dependency
Cons: Big new lib
Cons: Requires communication/ports to be open between grafana-servers running in HA

bergquist commented 6 years ago

Database polling Tieing the DB transaction to the invalidation sounds like a good idea. But I think we can achieve the same goal using the bus and publish invalidation message after successful transactions. We can do a type assertion on the cmd and see if it implements InvalidateKeys() []string or something like that. I strongly dislike requiring sticky sessions.

GroupCache Looks like a nice project, but

does not support versioned values. If key "foo" is value "bar", key "foo" must always be "bar". There are neither cache expiration times, nor explicit cache evictions. Thus there is also no CAS, nor Increment/Decrement. This also means that groupcache....

Makes it a bad candidate for our use-case? Unless we keep create keys with timestamps and.... MESSY

Redis Requiring Redis is a big change. But I think its a reasonable requirement considering what we can offer in return. I also think it's on par with requiring sticky sessions (consider how many issues this will cause) Redis also enables us to start using websockets/longpolling.

woodsaj commented 6 years ago

Database polling

This is not a great option of HostedGrafana. Polling the database for changes puts unnecessary load on the database cluster. We have many thousands of instances, and having them all poll the DB to look for something that 99% of the time isn't there is not ideal.

woodsaj commented 6 years ago

go-xorm/xorm already supports caching. I definitely think we should be using that. https://godoc.org/github.com/go-xorm/xorm#Engine https://godoc.org/github.com/go-xorm/core#Cacher

For single Grafana instances, you could just use the LRU cache. https://godoc.org/github.com/go-xorm/xorm#LRUCacher

When running multiple Grafana servers there is alreay a Redis cache implementation https://github.com/go-xorm/xorm-redis-cache

I would also like to see a Memcache version as well, as memcache is easy to deploy and run.

bergquist commented 6 years ago

Using caching in xorm sounds like a good idea.

What I dont like about it is that you have to clear the cache sometimes.

Caution: When use Cols methods on cache enabled, the system still return all the columns. When using Exec method, you should clear cache：
engine.Exec("update user set name = ? where id = ?", "xlw", 1)
engine.ClearCache(new(User))

So easy to miss clearing the cache. Requiring tests for all operations involving the orm would make it easier.

torkelo commented 6 years ago

The problem is not what cache to use but when to invalidate in ha setups :)

torkelo commented 6 years ago

I think data source permissions will require us to solve this. Data source access is cached today (only a 5 second cache) but that 5s saves an incredible amount of DB queries as it is used in the datasource proxy & tsdb/query code paths.

But in order to check permissions as well in these code paths without a DB query we would need to cache that as well.

Wish we had a distributed cache, but so afraid to add that requirement for all HA setups. But been trying to figure out another way to do cache invalidations in a HA system using only DB but it's hard.

bergquist commented 6 years ago

As I see there are two ways to solve this problem.

Distributed cache that keeps the shared state
Local cache and invalidate all local caches per entity upon changes.

Distributed cache

bad: This can be a SPOF unless setup correct to have failovers. If Grafana only uses the distributed cache as a short-term cache this should be fine since cold starts of new cache servers should be a big problem.
bad: Yet another infra piece to run grafana. (We should compare Grafana to similar infra tools that can be operated in HA mode and see if its a common requirement)
good: Keeping session data in a cache server is a good thing regardless. We should recommend it for everyone.
good: Easier to reason about since the distributed cache will be the single source of truth.
note: DB can be used as a naive cache.
note: this can also be used to invalidate data by polling the cache every second and comparing cache version with DB version (for those entities that have a version, we can always add versions to those entities who don't have version)

Local cache and invalidate when source updates (ex pubsub / server to server communication)

good: don't require more infra software
good: pub/sub can be reused for other purposes. ex web socket com (all thou I think this should be solved differently and we should focus as the problems we have in front of us)
bad: have to keep update to list of other Grafana servers. this is solvable for most operators but make grafana way more complicated to operate.
bad: it's very hard to say when all servers have invalidated their cache. Hence we cannot guarantee that things as invalidated.

So. Do we prefer Availablity or Consistency? :)

marefr commented 6 years ago

Given that datasources should not change that often I think we should optimize for that and should be able cache more aggressively than 5 seconds, for example several minutes. If each query passing thru datasource proxy should require a version of the datasource to be used we can invalidate the cache earlier than X minutes given a user have refreshed Grafana UI or similar. If we do this plus add permissions in the datasource cache and are able to verify permission in memory I think a local cache would be more than sufficient.

bergquist commented 5 years ago

I suggest that we have two caches in Grafana

In memory only cache. Like the one, we have now.
Distributed cache where we can store data between servers.

The distributed cache should be a facade that can be configured to store cache data in memory, redis, memcache and mysql/postgres. Defaulting to using the database configured to store dashboards to avoid breaking changes for those who run Grafana in HA environment. Since we already support different datastores in previous session implementation it would be quite easy to introduce this into Grafana.

Having two types of caches might seem confusing but that's something I hope we can address with naming. It’s hard to say if this would solve all our use-cases but it should solve all the use-cases listed in this issue.

grafana / grafana

Caching Service #10816

Requirements

Challanges

DB Table Solution

============================== ‣ id ‣ key ‣ value ‣

Redis

Integrated distributed cache

Distributed cache

Local cache and invalidate when source updates (ex pubsub / server to server communication)