Closed torkelo closed 5 years ago
Database polling
Tieing the DB transaction to the invalidation sounds like a good idea. But I think we can achieve the same goal using the bus and publish invalidation message after successful transactions. We can do a type assertion on the cmd and see if it implements InvalidateKeys() []string
or something like that.
I strongly dislike requiring sticky sessions.
GroupCache Looks like a nice project, but
does not support versioned values. If key "foo" is value "bar", key "foo" must always be "bar". There are neither cache expiration times, nor explicit cache evictions. Thus there is also no CAS, nor Increment/Decrement. This also means that groupcache....
Makes it a bad candidate for our use-case? Unless we keep create keys with timestamps and.... MESSY
Redis Requiring Redis is a big change. But I think its a reasonable requirement considering what we can offer in return. I also think it's on par with requiring sticky sessions (consider how many issues this will cause) Redis also enables us to start using websockets/longpolling.
Database polling
This is not a great option of HostedGrafana. Polling the database for changes puts unnecessary load on the database cluster. We have many thousands of instances, and having them all poll the DB to look for something that 99% of the time isn't there is not ideal.
go-xorm/xorm already supports caching. I definitely think we should be using that. https://godoc.org/github.com/go-xorm/xorm#Engine https://godoc.org/github.com/go-xorm/core#Cacher
For single Grafana instances, you could just use the LRU cache. https://godoc.org/github.com/go-xorm/xorm#LRUCacher
When running multiple Grafana servers there is alreay a Redis cache implementation https://github.com/go-xorm/xorm-redis-cache
I would also like to see a Memcache version as well, as memcache is easy to deploy and run.
Using caching in xorm sounds like a good idea.
What I dont like about it is that you have to clear the cache sometimes.
Caution: When use Cols methods on cache enabled, the system still return all the columns. When using Exec method, you should clear cache:
engine.Exec("update user set name = ? where id = ?", "xlw", 1) engine.ClearCache(new(User))
So easy to miss clearing the cache. Requiring tests for all operations involving the orm would make it easier.
The problem is not what cache to use but when to invalidate in ha setups :)
I think data source permissions will require us to solve this. Data source access is cached today (only a 5 second cache) but that 5s saves an incredible amount of DB queries as it is used in the datasource proxy & tsdb/query code paths.
But in order to check permissions as well in these code paths without a DB query we would need to cache that as well.
Wish we had a distributed cache, but so afraid to add that requirement for all HA setups. But been trying to figure out another way to do cache invalidations in a HA system using only DB but it's hard.
As I see there are two ways to solve this problem.
So. Do we prefer Availablity or Consistency? :)
Given that datasources should not change that often I think we should optimize for that and should be able cache more aggressively than 5 seconds, for example several minutes. If each query passing thru datasource proxy should require a version of the datasource to be used we can invalidate the cache earlier than X minutes given a user have refreshed Grafana UI or similar. If we do this plus add permissions in the datasource cache and are able to verify permission in memory I think a local cache would be more than sufficient.
I suggest that we have two caches in Grafana
The distributed cache should be a facade that can be configured to store cache data in memory, redis, memcache and mysql/postgres. Defaulting to using the database configured to store dashboards to avoid breaking changes for those who run Grafana in HA environment. Since we already support different datastores in previous session implementation it would be quite easy to introduce this into Grafana.
Having two types of caches might seem confusing but that's something I hope we can address with naming. It’s hard to say if this would solve all our use-cases but it should solve all the use-cases listed in this issue.
Right now in Grafana we do not have a good solution for caching. The only cache we have is a 5 second cache for data source settings (improves speed for dataproxy). Using go-cache.
Requirements
Challanges
DB Table Solution
One solution would be to have a Caching service that continually check a "message" table for invalidation messages. Say every 5 seconds.
Message table
============================== ‣ id ‣ key ‣ value ‣
5 invalidate-cache allowed-dashboards
4 invalidate-cache datasources
==============================
The idea would be to insert the invalidation messages in the same transaction that would effect the cache. So in a transaction that modified dashboard permissions it would also insert a message to invalidate that cache.
Redis
Use redis as pubsub & cache data store.
Integrated distributed cache
Using something like https://github.com/golang/groupcache that is a in-process distributed cache.